Percolator
Percolator is an algorithm that uses semi-supervised machine
learning to improve the discrimination between correct and incorrect spectrum
identifications. The matches from
searching a decoy database provide
the negative examples for the classifier, and a subset of the
high-scoring matches from the target database provide
the positive examples. Percolator trains a machine learning
algorithm called a support vector machine (SVM) to discriminate
between the positive and negative matches by assigning weights to a number of
features. Examples of features include Mascot score, precursor mass error,
fragment mass error, number of variable
modifications, etc. The vector of features with their optimal weights
is then be used to re-rank matches from all queries, often leading to improved sensitivity.
Percolator was developed by Lukas Käll, Jesse D Canterbury, Jason Weston,
William Stafford Noble, & Michael J MacCoss at the University of Washington,
Department of Genome Sciences. The software is released under an
Apache 2.0 licence
and included with Mascot by permission.
We would also like to acknowledge the work of Markus Brosch and colleagues at
the Sanger Centre, Hinxton, UK, who first applied Percolator to Mascot results
and developed a wrapper application called
Mascot Percolator.
There are a number of relevant publications:
- Kall, L., et al., Semi-supervised learning for peptide identification from shotgun proteomics datasets,
Nature Methods 4 923-925 (2007)
- Kall, L., et al., Posterior error probabilities and false discovery rates: Two sides of the same coin,
Journal of Proteome Research 7 40-44 (2008)
- Kall, L., et al., Assigning significance to peptides identified by tandem mass spectrometry using decoy databases,
Journal of Proteome Research 7 29-34 (2008)
- Kall, L., et al., Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry,
Bioinformatics 24 I42-I48 (2008)
- Brosch, M., et al., Accurate and Sensitive Peptide Identification with Mascot Percolator,
Journal of Proteome Research 8 3176-3181 (2009)
- Spivak, M., et al., Improvements to the Percolator Algorithm for Peptide Identification from Shotgun Proteomics Data Sets,
Journal of Proteome Research 8 3737-3745 (2009)
Percolator returns p values, q values and Posterior Error Probabilities (PEPs) for each match.
The q value can be thought of as the false discovery rate. If we acccept all matches with q values of 0.01 or less,
the false discovery rate will be 1%. The PEP is the probability that an individual match is a chance event.
The requirements for using Percolator to re-rank the matches from a Mascot search are:
- MS/MS search
- The search must include the results from an automatic decoy database search
- The search must contain at least 100 queries
- At least 100 database entries must be searched.
If these requirements are met,
the result report will include a checkbox Show Percolator scores. When this is checked and the report re-loaded, the
original Mascot scores will be replaced as follows:
- Score: -10log(PEP)
- Expect value: PEP
- Identity threshold score for p<0.05: 13
Features
The complete set of features that can be made available to Percolator is defined in code. You can
choose a sub-set of these features using a setting in the Options section of the Mascot configuration file,
mascot.dat. The default setting, as shipped, is:
PercolatorFeatures mScore,lgDScore, mrCalc, charge, dM, dMppm, absDM, absDMppm, isoDM, isoDMppm,
mc, varmods, totInt, intMatchedTot,relIntMatchedTot
For complete details of Percolator configuration settings and a description of the data flow, refer
to the Mascot Setup & Installation manual.
List of features available to Percolator
Feature name |
Description |
retentionTime | Retention time in seconds if available |
dM | Calculated minus observed peptide mass in Da |
mScore | Mascot score (always on) |
lgDScore | Mascot score minus Mascot score of next best non-isobaric peptide hit |
mrCalc | Calculated Mr |
charge | Charge |
dMppm | Calculated minus observed peptide mass in ppm |
absDM | Absolute value of calculated minus observed peptide mass in Da |
absDMppm | Absolute value of calculated minus observed peptide mass in ppm |
isoDM | Calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in Da |
isoDMppm | Calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in ppm |
mc | Number of missed cleavages (always 0 if no enzyme) |
varmods | Number of modified sites divided by number of modifiable sites |
varmodsCount | The number of variable mods used in the peptide. That is, if there are 10 Met and 5 of these are oxidised, this counts as 1. A peptide with Met-OX, phosphoS, deamidation, and acetylation, would count as 5. |
modifiable | Total number of modifiable sites |
modified | Total number of modified residues and terminii |
totInt | Log total ion intensity. The 20 most intense peaks in each 100 Da bin are used for all features, and totInt reports this value |
intMatchedTot | Log total matched ion intensity |
relIntMatchedTot | Total matched ion intensity divided by total ion intensity as a percentage (no logs involved) |
fragDeltaMed | Median value of all matched fragment errors in Da |
fragDeltaIqr | Interquartile range value of all matched fragment errors in Da |
fragDeltaMedPPM | Median value of all matched fragment errors in ppm |
fragDeltaIqrPPM | Interquartile range value of all matched fragment errors in ppm |
fragDeltaPolyFit | 2nd order polynomial fit to m/z vs delta. Result is RSquared multiplied by the number of points divided by 100 |
longest | Longest sequence matched ions, reported separately for each ion series (backbone only), as with fracIonsMatched |
fracIonsMatched | Fraction of calculated ions matched, reported separately for each ion series, with NLs lumped together (e.g. fracIonsMatchedB1, fracIonsMatchedB1deriv, fracIonsMatchedB2, fracIonsMatchedB2deriv) |
matchedIntensity | Matched ion intensity, reported separately for each ion series, as with fracIonsMatched |
qmatch | The number of peptide matches for which an ms-ms match was attempted |
peptide | The peptide string that was matched |
proteins | A tab separated list of accessions of proteins that contain this peptide. Must be last feature in list |
One feature is treated differently from the others: retention time. If retention time is included in the peak list, so that it
is available in the Mascot result file, it can be used as a feature by comparing the experimental RT values with values calculated by
Percolator. To enable this:
- The peak list must supply retention time information using the MGF
RTINSECONDS parameter. It is not sufficient to have
the information embedded in the scan title string
- retentionTime must be listed in the PercolatorFeatures line in mascot.dat
- In the Options section of mascot.dat, set PercolatorUseRT to 1 to turn this feature on by default.
Otherwise, add the argument percolate_rt=1 to the report URL
Application Notes
Percolator will usually give a worthwhile improvement in sensitivity. There are occasions when it can fail.
For example, if there are very few good matches in the search results, it may not have enough positive
examples to work with.
If there are multiple, high scoring matches to a single query, the current
approach is to submit only the first rank match to Percolator. The other matches to the same
query are then re-scored by pro-rating the new score for the rank 1 match. Thus, if there were matches
to multiple peptides which differed only in (say) I and L, all of which had the same Mascot score, they
would still have the same score after re-ranking with Percolator. Similarly, if the top 3 matches had
Mascot scores of 60, 50, and 40, and Percolator re-scored the rank 1 match to 54, the rank 2 and 3 matches
would be re-scored to 45 and 36. This avoids anomalies, but it is not ideal. If we accept that the
weighted vector of features is doing a good job of re-ranking matches from different queries, it
is only logical to re-rank the alternative matches to a single query. This would allow a rank 2 match
to be promoted over the rank 1 match, which cannot happen using the current approach.
|