Sequence Database Setup: UniProt Proteomes
Overview
A UniProt
complete proteome consists of the set of proteins thought to be expressed
by an organism whose genome has been completely sequenced.
A reference proteome is the complete proteome of a representative,
well-studied model organism or an organism of interest for biomedical research.
UniProtKB is a collaboration between the
European Bioinformatics Institute, the
Swiss Institute of Bioinformatics and the
Protein Information Resource.
Download
Fasta files representing the proteome for an organism
can be downloaded by searching for a specific taxonomy accompanied by the
keyword "Complete proteome":
- Perform the query and view the resulting list of entries
(e.g. organism:9606 AND keyword:"Complete proteome" for the human proteome
- Click the orange Download button in the query result page
- Choose Fasta, Canonical and isoform sequence data in FASTA format
For example, to get the complete proteome for rice, search for
taxonomy:4530 AND keyword:"Complete proteome".
In Database Manager, create a new custom definition using UniProt_proteome_template as the template.
You can enable automatic updating of a
UniProt Proteome by setting the Fasta file URL. Just change the taxonomy ID in this sample URL to
the one for your proteome of interest:
http://www.uniprot.org/uniprot/?query=taxonomy:4530+AND+keyword:"Complete+proteome"&force=yes&format=fasta&include=yes
The complete configuration for the rice proteome in Database Manager will look similar to this
Taxonomy
Taxonomy is not required for a single organism database
Parse Rules
When a single entry is expanded into entries for multiple isoforms, they share the same ID, so
AC must be used as the unique identifier
>sp|Q67W82-2|4CL4_ORYSJ Isoform 2 of Probable 4-coumarate--CoA ligase 4 OS=Oryza sativa subsp. japonica GN=4CL4
AC from Fasta title: ">..|\([^|]*\)"
Description from Fasta title: ">[^ ]* \(.*\)"
A Fasta file containing canonical and isoform sequence for the rice proteome was downloaded to
/usr/local/mascot/sequence/rice_proteome/current, and renamed to
rice_proteome_20120414.fasta.
Full text for individual entries can be retrieved across the web
from Uniprot:
Host: www.uniprot.org
Port: 80
Path: /uniprot/#ACCESSION#.txt
Parse rule: RULE_23 "\(.*\)"
Always test a new definition before applying the changes to mascot.dat.
|