Sequence Database Setup: UniRef
This is a Predefined Database Definition |
The information on this page is maintained as a service to users of Mascot 2.3 and earlier.
In Mascot 2.4, UniRef100 is a predefined database, meaning up-to-date configuration information can be
downloaded automatically by Mascot Database Manager.
|
|
Overview
UniRef, also known as
UniProt NREF, is a set of comprehensive protein
databases curated by the
Universal Protein Resource consortium.
There are three versions of UniRef: UniRef100, UniRef90, and UniRef50. UniRef100 is non-identical,
while UniRef90 and UniRef50 are non-redundant at a sequence similarity level of 90% and
50% respectively. Searching with mass spectrometry data requires the exact sequence to be present
in the database, so UniRef100 is the version to choose.
Download
PIR:
ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/
EBI:
ftp://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref100/
Expasy:
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref100/
The files are:
- Version info: uniref100.release_note
- Fasta file: uniref100.fasta.gz
Note that the XML file, uniref100.xml.gz, contains essentially the same information as the
Fasta file. It is not a full text reference file.
To download SwissProt updates automatically in Mascot 2.3 and earlier,
the relevant definition block in db_update.pl is UniRef100_Fasta_from_EBI.
Taxonomy
If you have Mascot 2.0 or earlier, add the following taxonomy
definition to mascot.dat, changing the taxonomy block number so as to be consecutive with the existing
blocks. If you have Mascot 2.1 or 2.2, you will need to update the existing taxonomy definition, because the database curators
recently made changes to the fasta title syntax. Make a backup copy of mascot.dat, then use a text editor to
make these changes. Note that the file must be saved as plain text, so be careful if using a word processor,
and ensure the filename is not changed to mascot.dat.txt or something.
# TAXONOMY FOR UniRef
Taxonomy_12
Identifier UniRef Fasta
Enabled 1 # 0 to disable it
FromRefFile 0
ErrorLevel 0
SpeciesFiles NCBI:names.dmp
NodesFiles NCBI:nodes.dmp
DefaultRule NCBI, CHOP:W "Tax=\(.*\) RepID=" # from release 14.0 onwards
end
The following taxonomy file is required:
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
Remember that the taxonomy files go into the taxonomy directory, not into the sequence database
directory. Also, these files need to be unpacked (using tar) as well as uncompressed.
Parse Rules
A typical UniRef Fasta title line is:
>UniRef100_Q4U9M9
104 kDa microneme/rhoptry antigen n=1 Tax=Theileria annulata RepID=104K_THEAN
The literal text, UniRef100_, should be dropped from the accession string, to make linking easier.
Accession from Fasta title: ">UniRef100_\([^ ]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"
Configuration (Mascot 2.3 and earlier)
For this example, the fasta file was downloaded to
C:\Inetpub\MASCOT\sequence\uniref100\current,
decompressed using Gzip,
and renamed to uniref100_9.6.fasta. Note that the rule numbers in your copy of mascot.dat may differ from
those in the screen shot
There isn't a downloadable reference file for UniRef, but full text for individual entries can be
retrieved across the web from the EBI SRS server.
For an SRS7
server, the syntax for the Path field is:
HTML: /srsbin/cgi-bin/wgetz?-e+[UNIREF100:UniRef100_#ACCESSION#]
(test)
Plain text: /srsbin/cgi-bin/wgetz?-e+[UNIREF100:UniRef100_#ACCESSION#]+-vn+2
(test)
If you don't require full text in a Mascot Protein View report, simply leave the Host, Port,
and Path fields blank and choose
--- no full text report ---
in the drop down list.
Always test a new definition before applying the changes to mascot.dat.
|