Sequence Database Setup: Database Manager
Database Manager is a browser-based utility for configuring and
updating local copies of sequence databases.
It replaces both the Database Maintenance Utility and the Database Update script,
which were components of Mascot 2.3 and earlier.
The file formats
and download locations of sequence databases change from time to time. One of the smart features
of Database Manager is that database configurations for the most popular public databases
are updated automatically, by downloading configuration data from the Matrix Science
web site.
This means that, for databases such as SwissProt and NCBInr, all you need
do to make a new sequence database available for searching in Mascot is:
- Choose to enable the database
- Decide where the files should live
- Optionally, specify an update schedule
If the file format or the download URL changes at some future date, Database Manager
will get new configuration information prior to updating the database files.
Predefined Database Definition: Configuration information for the most popular
public databases is kept up-to-date on the Matrix Science web site, and downloaded as
required by Database Manager. You don't need to know file URLs or worry about parse rules, etc.
for a Predefined Database.
Custom Database Definition: If you want to search a database that is not
included in the list of Predefined Database Definitions, or if you want to configure one
of these databases in some non-standard way, you create a Custom Database Definition.
Synchronisation: If a custom definition is very similar to a predefined
definition, it can be converted into a predefined definition by being synchronised.
The advantage of doing this is that the configuration will then be kept up-to-date
automatically.
Tasks: Database files can be very large, and downloading may take a long time.
Database Manager processes tasks serially in
the background as long as Mascot Monitor (ms-monitor.exe) is running.
Update Schedule: An schedule can be created to update all the files associated with a
database automatically. Maybe once each week or each month. Files will only be downloaded
if a new version is available.
Parse Rules: The format of Fasta title lines varies from database to database.
Each database definition must include accession and description parse rules,
which tell Mascot how to extract a unique identifier and a description for each entry.
Active vs. Inactive: An Inactive database definition is effectively
hidden from Mascot, and the database will not appear on the Database Status page.
A new custom definition is inactive until configuration is complete and all the required files
are in their final locations. An Active database definition is visible to Mascot.
If there is a Fasta file, Mascot will try to compress it and bring the database On-line.
This can fail for all sorts of reasons, such as a missing or corrupt file or a mistake in
the configuration, and Database Status will show an error. Hence, an active database
definition does not necessarily mean the database is In Use.
You might wish to set an active definition to be inactive if you don't want to
see it listed in the search form or if there is some problem with the definition that
you don't want to resolve immediately.
IMPORTANT: Database
Manager must be allowed
exclusive control of database configuration.
Editing mascot.dat outside of Database Manager will
just cause confusion because Database Manager re-writes mascot.dat whenever
a configuration changes. If you prefer to configure sequence databases manually,
by editing mascot.dat, never run Database Manager. Manual procedures are
described here.
When Database Manager is run for the first time, it imports existing
database definitions from mascot.dat. If a definition looks similar to a Predefined
Definition, you will be offered the option to synchronise it.
Database Manager will also try to download the latest configuration
file from the Matrix Science web site. It is at this point that any problem with the
connection to the Internet will be discovered. If you see the following warnings,
unless the Mascot Server is intentionally isolated from the Internet,
choose to configure the proxy settings and save them.
If the Mascot Server is intentionally isolated from the Internet,
choose Do not use the Internet to avoid seeing constant error messages about
failed connections.
Predefined database definitions will be taken from a file that was part of
the Mascot installation, and may now be out-of-date. You can update this
file manually by downloading
databases_1.xml
on a machine with Internet access and copying the file to the Mascot
config/db_manager/public directory.
Once any connection issues have been resolved, the configuration import page
is displayed. If Database Manager is being run after a clean installation of Mascot, the only
existing definition will be SwissProt. In most cases, you will want to synchronise
this definition with the predefined one, and choose Import. If you have upgraded
an existing Mascot installation, the database definitions in mascot.dat will
be listed, and you need to decide which to synchronise and which to keep as
custom definitions.
Database Manager tries to match existing definitions against predefined definitions and
reports the quality of the match as none, poor, good, or perfect. For poor or good
matches, the differences can be inspected. Usually, these arise because the existing definition
is out-of-date in some respect.
If the Mascot Server is not allowed to access the Internet,
choose Keep as Custom unless the
match is perfect. This is because synchronisation of any definition
where the match is not perfect requires the database files to be updated.
Even if you have an Internet connection, choose Keep as custom for
any database with a poor or good match unless you want to
update the database files or if you see difference
in the existing definition that you want to preserve.
Choose Import to proceed. The
list of Databases will be displayed, with status information for those that
have been synchronised and need updating.
Custom definitions that are possible matches to predefined ones can be made
predefined at any time by choosing Synchronise custom definitions
You can add new databases in four different ways:
- Enable predefined definition
Apart from confirming a location
for the downloaded files, everything will be handled automatically.
Only one instance of each predefined definition can be enabled at any one time,
as database names must be unique.
If you want to enable a predefined database, but make changes to the configuration, e.g.
to keep an old version on-line, choose Create New;Use predefined definition template.
- Create New; Custom
Create a new custom database definition from scratch.
- Create New; Copy Of
Create a new custom database definition by copying an existing definition. You will be required to
enter a new database name and given the choice of copying the existing database files.
- Create New; Use predefined definition template
Create a new custom database definition by starting from a predefined definition.
The differences between this and enabling a predefined definition are
(i) you can make changes to the configuration, (ii) the definition will not be
kept up-to-date automatically.
When a new database is created, unless it is predefined, you will either need to supply
download URLs for the files or copy the files manually to the specified directory
on the Mascot Server before configuration can
be completed. This is primarily to allow parse rules to be tested against the Fasta file,
but it also verifies that the download URL works or that the manually copied files have
the correct names and security settings / permissions.
Drop-down help is provided for each element in the configuration pages.
The following terms may benefit from additional explanation:
Database Name: Each database must have a unique name. Ideally, the name should be short and descriptive.
Note that these names are case sensitive, and much confusion can be caused by creating both SwissProt and swissprot.
Local paths: The delimiters between directories
must always be forward slashes, even if Mascot is running on a Windows system. The default
parent directory for sequence database directories can be specified on the Settings page.
Memory mapping and locking: Memory mapped files can be locked in memory,
but only if the computer has sufficient RAM.
Having a database locked in memory means that it can never be swapped out to disk, ensuring maximum possible
search speed. If you try to lock databases into RAM when there isn't room, this will not be a major problem.
The locking will fail, generate an error message, and Mascot will carry on regardless. A more serious problem
is when there is just sufficient RAM to lock the databases, but none left over for searches or other applications.
In this case, the whole system will slow down and the hard disk will be observed to be "thrashing". Eventually,
the system is likely to hang or crash.
Threads: A Mascot search can use multiple threads, so as to make use of all the logical processors
covered by the licence.
Usually, it is best to leave threads set to -1, which means automatic. If you want to restrict the number
of threads on a non-cluster (SMP) system, you can do so by setting a value of 1 or more.
Each CPU in the Mascot licence allows use of up to 4 cores,
which requires 8 threads for a hyperthreaded processor or 4 otherwise. On a cluster system, the number of threads
is set for each search node in a separate configuration file, nodelist.txt.
If a URL is specified for downloading the Fasta file,
you can create an update schedule. This can be done when the database is first added or later,
by clicking on the name hyperlink in the databases list.
Allow Internet access: If the Mascot Server machine has no Internet connection
or if you do not wish Database Manager to access the Internet, this should be set
explicitly to avoid getting error messages. If Internet access is prevented, you cannot
download databases, which means that predefined definitions cannot be enabled.
Use Create New; Use predefined definition template instead, and manually copy
the required files to the Mascot server.
You can update the predefined database defintions file periodically by downloading
databases_1.xml
on a machine with Internet access and copying the file to the Mascot
config/db_manager/public directory.
Allow external full-text reports: Even when Internet access is enabled, it
may be undesirable to allow reports for specific database entries to be
retrieved from Internet sources. You can change the source for external reports
in a custom definition but not in a predefined definition. Disabling external sources
here blocks all external full-text reports.
Proxy: If automatic HTTP proxy detection fails, or if the proxy server
is password protected, enter and save details. Native FTP proxy servers are not
supported.
Sequence directory: The files for each database reside in a directory with the same
name as the database. When a new database is added, the sequence directory
specifies the default path
under which the database directory will be created unless it already exists.
This is only a default, and you can change the path during
configuration of a database. Database directories do not have to be kept together,
and can be distributed across drives or partitions as convenient.
If you choose remote storage, make sure the connection
is fast and reliable and that memory mapping is supported.
Windows UNC paths are not supported. The delimiters between directories must
always be forward slashes, even if Mascot is running on a Windows system.
Important files:
- db_manager.etags.2
Successful downloads of database files are recorded in a file called
db_manager.etags.2 in the incoming directory for the database.
Each new version of a database is downloaded once. If you try to download the
same file(s) a second time, maybe because the Fasta was accidentally deleted,
Database Manager will report that no new files are available.
To force a new download, delete db_manager.etags.2 before choosing Update.
- mascot.dat
The general configuration file, mascot.dat in the config directory,
is re-written by Database Manager
whenever configuration changes are saved, so it is pointless to edit the database
related sections. Files with names like 2012-04-12_135833.mascot.dat are
backups of mascot.dat
- global.conf
Global settings, such as proxy server details, are saved to global.conf in
the config/db_manager directory.
- databases_1.xml
When Mascot is installed, the initial set of predefined database definitions
is a file called databases_1.xml in the config/db_manager/public directory. If
there is an Internet connection, whenever Database Manager tries to update a
database, it checks for updates to this file. If a new version is available, it
is downloaded to a file with a name like 2012-04-13-15-34-30.xml. (Note: these
are not backups and must not be deleted).
- configuration.xml
Database configuration information is saved to configuration.xml in
the config/db_manager directory. Files with names like 2012-04-14_144014.configuration.xml
are backups. For custom definitions, all the configuration information
is in configuration.xml. For predefined databases, only limited settings are in
configuration.xml. Most settings are inherited from either databases_1.xml or a later
version, e.g. 2012-04-13-15-34-30.xml. This is because the configuration for a predefined
database is only updated when the database files themselves are updated.
To illustrate, imagine
we have SwissProt 2011_01 as a predefined database. Six months later, with release 2011_07,
the Fasta title line changes so as to require a new accession parse rule. A new
version of databases_1.xml is posted on the Matrix Science web site and downloaded by
Database Manager. However, until your local copy of
SwissProt is updated to 2011_07 or later, you don't actually
want to use the new accession parse rule, because this
could break the configuration for the files from the earlier release. So, the definition
in configuration.xml specifies that the configuration settings are inherited from the earlier file
until SwissProt is updated, at which point, the definition in configuration.xml will be changed to specify
that settings are inherited from the latest public file.
Editing configuration.xml:
In the first release of Database Manager, there is no user interface for
the following:
- Change which Unigene index files are associated with a database
- Create or modify a taxonomy parse rule. You can only select from existing
taxonomy parse rules
If you wish to change either of these, create a custom definition
for the database in question, possibly by using a predefined definition as a template.
(Trying to modify a predefined definition
is problematic because your changes will be lost each time the database is updated after a new
databases_1.xml file has been downloaded.)
Open configuration.xml in a plain text editor or
specialised XML editor and identify the element containing the database definition. For example,
if the database is called citrus_EST, the definition will be between
tags <msgd:database name="citrus_EST"> and </msgd:database>
UniGene: The list of known UniGene index files follows the tag <msgd:unigene_entries>
If you wanted to make the indexes Citrus_clementina and Citrus_sinensis available for
citrus_EST, you would need to change the database definition as follows.
Replace
<msgd:unigene>
<msgd:enabled>0</msgd:enabled>
</msgd:unigene>
With
<msgd:unigene>
<msgd:enabled>1</msgd:enabled>
<msgd:indices>Citrus_clementina Citrus_sinensis</msgd:indices>
</msgd:unigene>
And save.
Taxonomy: It is extremely unlikely that you will need to create a new taxonomy
parse rule. But, for completeness, the syntax is described in Chapter 9 of the
Installation & Setup manual. Taxonomy parse rules follow the <msgd:taxonomy_entries>
tag. Identify a similar taxonomy parse rule in the latest public definitions file, copy it
to configuration.xml, and modify it as required, not forgetting to give it a unique name.
In the database definition, the taxonomy parse rule is referenced by this unique name, e.g.
<msgd:taxonomy_entry>All human with TaxID 9606</msgd:taxonomy_entry>.
If you simply want to add a new category to the taxonomy filter drop-down list that
appears in the search form, this does not require any changes to database configuration files.
Just edit the file called taxonomy in the Mascot config directory, as explained in Chapter 9 of the
Installation & Setup manual.
|