Matrix Science - Help - Sequence Database Setup

Sequence Database Setup: UniRef

This is a Predefined Database Definition

The information on this page is maintained as a service to users of Mascot 2.3 and earlier. In Mascot 2.4, UniRef100 is a predefined database, meaning up-to-date configuration information can be downloaded automatically by Mascot Database Manager.

Overview

UniRef, also known as UniProt NREF, is a set of comprehensive protein databases curated by the Universal Protein Resource consortium. There are three versions of UniRef: UniRef100, UniRef90, and UniRef50. UniRef100 is non-identical, while UniRef90 and UniRef50 are non-redundant at a sequence similarity level of 90% and 50% respectively. Searching with mass spectrometry data requires the exact sequence to be present in the database, so UniRef100 is the version to choose.

Download

PIR: ftp://ftp.uniprot.org/pub/databases/uniprot/uniref/uniref100/
EBI: ftp://ftp.ebi.ac.uk/pub/databases/uniprot/uniref/uniref100/
Expasy: ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/uniref/uniref100/

The files are:

Version info: uniref100.release_note
Fasta file: uniref100.fasta.gz

Note that the XML file, uniref100.xml.gz, contains essentially the same information as the Fasta file. It is not a full text reference file.

To download SwissProt updates automatically in Mascot 2.3 and earlier, the relevant definition block in db_update.pl is UniRef100_Fasta_from_EBI.

Taxonomy

If you have Mascot 2.0 or earlier, add the following taxonomy definition to mascot.dat, changing the taxonomy block number so as to be consecutive with the existing blocks. If you have Mascot 2.1 or 2.2, you will need to update the existing taxonomy definition, because the database curators recently made changes to the fasta title syntax. Make a backup copy of mascot.dat, then use a text editor to make these changes. Note that the file must be saved as plain text, so be careful if using a word processor, and ensure the filename is not changed to mascot.dat.txt or something.

# TAXONOMY FOR UniRef Taxonomy_12 Identifier UniRef Fasta Enabled 1 # 0 to disable it FromRefFile 0 ErrorLevel 0 SpeciesFiles NCBI:names.dmp NodesFiles NCBI:nodes.dmp DefaultRule NCBI, CHOP:W "Tax=\(.*\) RepID=" # from release 14.0 onwards end

The following taxonomy file is required:
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

Remember that the taxonomy files go into the taxonomy directory, not into the sequence database directory. Also, these files need to be unpacked (using tar) as well as uncompressed.

Parse Rules

A typical UniRef Fasta title line is:

>UniRef100_Q4U9M9 104 kDa microneme/rhoptry antigen n=1 Tax=Theileria annulata RepID=104K_THEAN

The literal text, UniRef100_, should be dropped from the accession string, to make linking easier.

Accession from Fasta title: ">UniRef100_\([^ ]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"

Configuration (Mascot 2.3 and earlier)

For this example, the fasta file was downloaded to C:\Inetpub\MASCOT\sequence\uniref100\current, decompressed using Gzip, and renamed to uniref100_9.6.fasta. Note that the rule numbers in your copy of mascot.dat may differ from those in the screen shot

Mascot database maintenance utility

There isn't a downloadable reference file for UniRef, but full text for individual entries can be retrieved across the web from the EBI SRS server. For an SRS7 server, the syntax for the Path field is:

HTML: /srsbin/cgi-bin/wgetz?-e+[UNIREF100:UniRef100_#ACCESSION#] (test)
Plain text: /srsbin/cgi-bin/wgetz?-e+[UNIREF100:UniRef100_#ACCESSION#]+-vn+2 (test)

If you don't require full text in a Mascot Protein View report, simply leave the Host, Port, and Path fields blank and choose
--- no full text report ---
in the drop down list.

Always test a new definition before applying the changes to mascot.dat.