|
MS/MS Results Interpretation
Other help pages describe the format and content of the various result reports. In particular, refer to
Result Report Overview and
Summary Reports for MS/MS.
This page attempts to explain some of the underlying concepts, especially those relating to protein
inference.
In Mascot, the ions score for an MS/MS match is based on the calculated probability, P,
that the observed match between the experimental data and the database sequence
is a random event. The reported score is -10Log(P). So, during a
search, if 1500 peptides fell within the mass tolerance window about the
precursor mass, and the significance threshold was chosen to be 0.05, (a 1 in 20
chance of being a false positive), this would translate into a score threshold of 45.
If the quality of an MS/MS spectrum is poor, particularly if the signal to noise
ratio is low, a match to the "correct" sequence might not exceed this absolute
threshold. Even so, the best match could have a relatively
high score, which is well separated from the distribution of
1500 random scores. In other words, the score is an outlier. This would
indicate that the match is not a random event and, if tested using a method
such as a target-decoy search,
such matches can be shown to be reliable.
For this reason, Mascot also attempts to characterise the distribution of random
scores, and provide a second, lower threshold to highlight the presence of any
outlier. The lower, relative threshold is reported as the homology threshold
while the higher threshold is reported as the identity threshold.
The identity threshold is still useful because it is not always possible to
estimate a homology threshold. If the instrument accuracy is very high or the database
is very small, there may only be a small handful of candidate sequences,
so that it is not possible to say whether a match is an outlier.
For a search of at least 1000 spectra, where an automatic
decoy search was used, you can choose to process the Mascot scores through
Percolator. This
uses machine learning to re-rank the matches, so as to obtain an optimum false discovery
rate. The revised probabilites are converted to scores for reporting purposes, together with
a single score threshold to indicate significance.
The protein score in the result report from an MS/MS search is derived from the ions scores.
For a search that contains a small number of queries, the protein score is the sum of the
highest ions score for each distinct sequence. That is, excluding the scores of duplicate matches,
which are shown in parentheses.
A small correction is applied to reduce the contribution of low-scoring random matches.
This correction is a function of the total number of molecular mass matches for each query.
This correction is usually very small, except in no enzyme searches.
This protein score works well for small searches, and provides a logical order to the report. If multiple
queries match to a single protein, but the individual ions scores are below threshold, the
combined ions scores can still place the protein high in the report. However, the
standard protein score is less satisfactory for searches with very large numbers of queries,
such as MudPIT data sets. For each MS/MS query, Mascot
retains up to 10 peptide matches. When the number of queries is comparable with the number
of entries in the database, this means that there can be random, low-scoring matches
for every entry. Although the average number of random matches per entry might be low, the actual number
will follow a distribution, and some entries will have large numbers of low scoring matches,
leading to large protein scores.
While it is obvious from a detailed study of the report that these are meaningless matches,
it would be better to eliminate them entirely. So, if the ratio between the number of queries
and the number of entries in the database exceeds a
pre-determined threshold, the basis for calculating the protein score is changed. Only
those ions scores that exceed one or both significance thresholds contribute to the score,
so that low scoring, random matches have no effect. This gives a much cleaner report for a large
scale search. This threshold is 0.001 by default, and can be changed on a global
basis in the configuration file, mascot.dat, or changed for a single report by using the
format controls at the top of the report. Note
that, when calculating this threshold, if a taxonomy filter is being used, the number of
entries in the database is the number remaining after the taxonomy filter.
When MS/MS spectra are searched against a sequence database, we are matching peptides, not proteins.
In most cases, the matched peptides will not be unique to a single protein. Yet,
we usually want to know which proteins were present in the sample. So, we are faced
with the challenge of protein inference: given a set of peptide matches, which
proteins do we believe were present in the sample?
The usual approach is based on the "Principle of Parsimony". We report
the minimum set of proteins that account for the observed peptide matches. If we had
four peptide matches, two of which occurred in protein A and two in protein B but
all four were found in protein C, we would report that protein C had been identified.
Proteins A and B might be listed as "sub-set" proteins. It is perfectly possible
that our sample actually contained a mixture of proteins A and B, but there is no
evidence for this.
The Peptide Summary and
Select Summary uses a very simple algorithm.
First, we take the protein with the highest protein score, and call this hit number 1.
We then take all other proteins that share the same set of peptide matches or a sub-set
and include these in the same hit. In the report, they are listed as same-set and
sub-set proteins. With these proteins removed from the list, we now take the remaining
protein with the highest score and repeat the process until all the significant
peptide matches are accounted for.
This sounds simple enough, and works well for small datasets, but larger search results
create difficulties:
- What if two proteins have many strong matches in common but one has an additional
weak match? Should we treat one as the outright winner, and relegate the other to
the status of sub-set?
- What if we have intersections? That is, the protein is not a sub-set of
any other one protein, but all the matches can be
found in a set of proteins, each of which has additional matches.
- In many cases, the exact sequence of the protein that was analysed is not in the
database. All the peptide sequences are present, but spread across several homologous
proteins, which might be splice variants or represent different combinations of SNPs.
The Protein Family Summary tries
to address these difficulties by clustering proteins into families. The algorithm works
as follows:
- Create a list of proteins, ordered by protein score
- Take the highest scoring protein
- Find all the family members for this protein:
- select all matches with a score at or above the homology threshold
- for each match, select all other the proteins that contain this match
(using the score as a test to include matches that are identical matches though
not identical sequences, e.g. I to L substitution or other differences that
have no impact on the score)
- for each new protein, select all new matches with a score at or above the homology threshold
- loop until all related proteins and matches have been found
Note that this grouping into families is based on significant matches. Non-significant matches are ignored.
- Report this family as a single hit. All these proteins can be removed from the list
- For each protein in the family, make a list of the distinct peptide sequences. That is, ignore differences in score,
modifications, charge, etc. Where there are duplicate matches, use the highest score
- Divide and group the proteins into same-set proteins and sub-set proteins; sub-sets include intersections
- Where there are same-set proteins, collapse into a single family member
- Move any proteins that are sub-sets or intersections to the sub-sets list
- Perform hierarchical clustering on the family members, using the score
excess over threshold of the non-shared matches as the distance metric
- Loop from step 2 until no more proteins remain that contain matches with homology score or better
The goal is to present the possible protein assignments clearly, so that someone with knowledge
of the biology can make an informed decision as to which proteins are present. In most cases, there
will be some ambiguity about precisely which proteins are present. As mentioned earlier, the exact sequence of
an analyte may not be in the database, and peptide matches may be distributed across multiple, homologous
databse entries. If it is essential to characterise the complete protein sequence, or to choose between
splice variants, or to confirm a SNP, it is likely that additional, targeted experiments will be required.
To cluster proteins into families, we use the score of the non-shared matches as the distance
between two proteins. More precisely, we use the score excess over the significance threshold,
since a score below significance threshold could be random, and should not be taken as evidence
for two different proteins being present. This means that matches below threshold play no part
in the clustering process. Each distinct peptide sequence is represented once
by the match with the highest score. Matches to the same sequence with different charge states
or with different modifications are considered duplicates.
If two proteins have the same set of
peptide matches, the distance
between them is zero. If they have just a single shared match, the distance between them is the
sum of the score excesses of all the non-shared matches in one protein, since discarding these
would make the protein a sub-set of the other, based on the single shared match.
There are some subtleties to this procedure. Consider the case of two proteins which have different
peptide matches to the same query with the same score. Only one of these matches can be correct,
but we don't know which. One obvious example is where the
two sequences differ only in exchange of I and L. In terms of the mass spectrum, these
sequences are identical. Unless the mass accuracy is high, the same is true for exchange
of Q and K or F and oxidised M. Clearly, a sequence containing F at a particular position
is very different, in biological terms, from one containing M at the same position. But,
if the scores are the same, there is simply no evidence from the mass spectrometry data for
two proteins. In terms of a distance matrix, we must treat it is as if there was no match
to either peptide.
Now, consider the case where we have two proteins with different
peptide matches to the same query and the scores are not the same. Assume the threshold is 40 and
one has a score of 50 and the other has a score of 60. Again, only one of these matches can be correct;
it is not the same as if they were independent matches to different queries. Extending the logic that
matches to the same query with the same score correspond to a distance of zero, matches to the
same query with different scores correspond to a distance that is the score difference. In this example,
the distance would be 10. If the two matches came from different queries, and could be treated independently,
the distance would be (60 - 40) + (50 - 40) = 30
To create the dendrogram, we first compute a distance matrix, which is the distance between each
pair of proteins. The two proteins separated by the smallest distance are joined to create a
node, with the length of the branches from the node are the score distance between the proteins.
The two joined proteins
are removed from list, replaced by the node, and the distances between the new node and all other remaining
proteins (or nodes) computed. The process is repeated until only one node remains.
When the dendrogram (or tree) is drawn, the order is chosen to avoid any branches crossing. There
is no other significance to the order of the branches, and there are many possible ways to order the branches
so as to avoid crossings. In the tabular part of the report, proteins are sorted in order of decreasing score,
and this will often be different from the dendrogram order.
Note that, if you select a pair of family members from a large family, it is perfectly possible
that they will have no shared matches. Each family member will have shared matches with at least one other
family member, or they would not have been grouped into the same family, but this doesn't mean that there are
going to be shared matches between every pair.
This Format Control allows you to
specify a preferred taxonomy for the anchor protein in cases where
there is a choice of indistinguishable proteins.
Imagine we are studying dormice, which are not well represented in any protein database.
We choose the broader taxonomy of Rodentia so that we can get matches to homologous
proteins from other rodents. But, if a hit contains same-set matches to proteins from
rat, mouse and dormouse, we can ensure the
dormouse entry will be selected as the anchor protein by specifying Gliridae as
the Preferred Taxonomy.
Another situation where Preferred Taxonomy can come in useful is for a database like
NCBI nr, where each entry represents multiple proteins. By default, it is always
the first protein in the title line that is selected as anchor protein. You might
search with a taxonomy filter of dog and pull out an entry for a protein that was found in both
cat and dog and happened to have cat listed first. Setting a Preferred Taxonomy of dog
will ensure the dog accession and description are selected for display in such cases.
In Mascot 2.4, the additional taxonomy information required for this function is saved in the
result file, and the preferred taxonomy control will always be available for new searches
of databases for which taxonomy is defined. If the result
file comes from Mascot 2.3 or earlier, the databases that were used in the search need to be
online. Otherwise the control will be hidden, because there would be no way
to retrieve the required taxonomy information.
Finally, note that the default taxonomy list shipped with Mascot is limited
to a small number of well characterised organisms, and this doesn't include either
cat or dog. So, for the second example, you would need to edit the file called taxonomy
in the Mascot config directory to add the required entries. For example, the categories under
mammals in the default file might look like this:
Title:. . . . . . . . . . . . Mammalia (mammals)
Include: 40674
Exclude:
*
Title:. . . . . . . . . . . . . . Primates
Include: 9443
Exclude:
*
Title:. . . . . . . . . . . . . . . . Homo sapiens (human)
Include: 9606
Exclude:
*
Title:. . . . . . . . . . . . . . . . Other primates
Include: 9443
Exclude: 9606
*
Title:. . . . . . . . . . . . . . Rodentia (Rodents)
Include: 9989
Exclude:
*
Title:. . . . . . . . . . . . . . . . Mus.
Include: 10088
Exclude:
*
Title:. . . . . . . . . . . . . . . . . . Mus musculus (house mouse)
Include: 10090
Exclude:
*
Title:. . . . . . . . . . . . . . . . Rattus
Include: 10114
Exclude:
*
Title:. . . . . . . . . . . . . . . . Other rodentia
Include: 9989
Exclude: 10088, 10114
*
Title:. . . . . . . . . . . . . . Other mammalia
Include: 40674
Exclude: 9443, 9989
*
To add dog to the list of choices, enter the text shown in red
Title:. . . . . . . . . . . . . . . . Other rodentia
Include: 9989
Exclude: 10088, 10114
*
Title:. . . . . . . . . . . . . . Canis familiaris
Include: 9615
Exclude:
*
Title:. . . . . . . . . . . . . . Other mammalia
Include: 40674
Exclude: 9443, 9989, 9615
*
The NCBI Taxonomy Browser is invaluable for looking up TaxID codes and
finding where a particular organism fits into the tree of life. It also lists
the number of entries in GenBank for each taxonomy, which is a useful way to discover whether
a particular taxonomy might be too narrow. Never choose a taxonomy that has less
than two thousand proteins; move to a higher level so as to search a reasonable number
of entries.
The Protein Family Summary is expressly designed for
large search results. Because it is a paged report, that initially displays
only the first ten families, it will usually succeed on a 32-bit platform and always on a
64-bit platform. If, for some reason, you need to view results using the earlier,
Select Summary report, this section contains some tips.
The format controls near the top of the report
can help streamline the results from a large search by eliminating most of the "junk".
If the report is too large to open in the first place, these options can also
be specified by adding URL switches to
the report URL.
- View the report on a client with plenty of free physical RAM.
Do not try to view the report in a browser running on the Mascot server
Select Summary: Ensure you are using the Select Summary.
If you are using a third party client that has specified Peptide Summary,
Add this to the URL before opening the file: &REPTYPE=select
Don't specify too many hits: Use AUTO to report only protein hits that contain
significant peptide matches
Add this to the URL before opening the file: &REPORT=AUTO
MudPIT Protein Scoring: By default, large searches
will switch to using more aggressive protein scoring. This removes
many of the junk protein hits, which have high protein scores but no high scoring
peptide matches. Do not be tempted to switch back to standard scoring.
Add this to the URL before opening the file: &_server_mudpit_switch=0.000000001
Require Bold Red: The Select Summary report does not detect intersections.
Red and bold
typefaces are used to highlight the most logical assignment
of peptides to proteins. The first time a peptide match to a query appears in the report, it is
shown in bold face. Whenever the top ranking peptide match appears, it is shown in red.
Thus, a bold red match is the highest scoring match to a particular query listed under the
highest scoring protein containing that match.
This means that protein hits with many peptide matches that are both bold and red are the most
likely assignments. Conversely, a protein that does not contain any bold red matches is
an intersection of proteins listed higher in the report.
Requiring a protein hit to include at least one bold
red peptide match is a good way to filter homologous proteins from a report.
The down-side is that you may sometimes throw out the wrong protein! For example, imagine you
are searching with a taxonomy of mammals but are mainly interested in yeti proteins. If the
same strong peptide matches are found in a yeti protein and also in the human homolog,
and one or more junk peptide matches prevent the two proteins collapsing into a single hit, but give the
human protein a slightly higher score, that is the one that will feature in the report.
Add this to the URL before opening the file: &_requireboldred=1
Ignore Ions Score Below: You can minimise the previous problem by judicious use of the
Ions score cut-off field. By setting this to a value of 1 or more, you filter out all of the matches
with lower scores. When set to a value between 0 and 1, it becomes an expect value cut-off,
filtering out matches with higher expect values. Removing random matches means that
homologous proteins are more likely to collapse
into a single hit.
Add this to the URL before opening the file: &_ignoreionsscorebelow=0.5
Suppress the pop-ups: The JavaScript pop-up windows, that show the top 10 peptide matches
for each query, are very useful, but they make the HTML report much larger and slower to load
in a web browser. If you have a report that never seems to load, or is very slow to scroll, try
using the radio buttons to suppress pop-ups.
Add this to the URL before opening the file: &_showpopups=FALSE
|
|
|