Percolator

Percolator is an algorithm that uses semi-supervised machine learning to improve the discrimination between correct and incorrect spectrum identifications. The matches from searching a decoy database provide the negative examples for the classifier, and a subset of the high-scoring matches from the target database provide the positive examples. Percolator trains a machine learning algorithm called a support vector machine (SVM) to discriminate between the positive and negative matches by assigning weights to a number of features. Examples of features include Mascot score, precursor mass error, fragment mass error, number of variable modifications, etc. The vector of features with their optimal weights is then be used to re-rank matches from all queries, often leading to improved sensitivity.

Percolator was developed by Lukas Käll, Jesse D Canterbury, Jason Weston, William Stafford Noble, & Michael J MacCoss at the University of Washington, Department of Genome Sciences. The software is released under an Apache 2.0 licence and included with Mascot by permission.

We would also like to acknowledge the work of Markus Brosch and colleagues at the Sanger Centre, Hinxton, UK, who first applied Percolator to Mascot results and developed a wrapper application called Mascot Percolator.

There are a number of relevant publications:

Kall, L., et al., Semi-supervised learning for peptide identification from shotgun proteomics datasets, Nature Methods 4 923-925 (2007)
Kall, L., et al., Posterior error probabilities and false discovery rates: Two sides of the same coin, Journal of Proteome Research 7 40-44 (2008)
Kall, L., et al., Assigning significance to peptides identified by tandem mass spectrometry using decoy databases, Journal of Proteome Research 7 29-34 (2008)
Kall, L., et al., Non-parametric estimation of posterior error probabilities associated with peptides identified by tandem mass spectrometry, Bioinformatics 24 I42-I48 (2008)
Brosch, M., et al., Accurate and Sensitive Peptide Identification with Mascot Percolator, Journal of Proteome Research 8 3176-3181 (2009)
Spivak, M., et al., Improvements to the Percolator Algorithm for Peptide Identification from Shotgun Proteomics Data Sets, Journal of Proteome Research 8 3737-3745 (2009)

Percolator returns p values, q values and Posterior Error Probabilities (PEPs) for each match. The q value can be thought of as the false discovery rate. If we acccept all matches with q values of 0.01 or less, the false discovery rate will be 1%. The PEP is the probability that an individual match is a chance event.

The requirements for using Percolator to re-rank the matches from a Mascot search are:

MS/MS search
The search must include the results from an automatic decoy database search
The search must contain at least 100 queries
At least 100 database entries must be searched.

If these requirements are met, the result report will include a checkbox Show Percolator scores. When this is checked and the report re-loaded, the original Mascot scores will be replaced as follows:

Score: -10log(PEP)
Expect value: PEP
Identity threshold score for p<0.05: 13

Features

The complete set of features that can be made available to Percolator is defined in code. You can choose a sub-set of these features using a setting in the Options section of the Mascot configuration file, mascot.dat. The default setting, as shipped, is:

PercolatorFeatures mScore,lgDScore, mrCalc, charge, dM, dMppm, absDM, absDMppm, isoDM, isoDMppm, mc, varmods, totInt, intMatchedTot,relIntMatchedTot

For complete details of Percolator configuration settings and a description of the data flow, refer to the Mascot Setup & Installation manual.

List of features available to Percolator
Feature name	Description
retentionTime	Retention time in seconds if available
dM	Calculated minus observed peptide mass in Da
mScore	Mascot score (always on)
lgDScore	Mascot score minus Mascot score of next best non-isobaric peptide hit
mrCalc	Calculated Mr
charge	Charge
dMppm	Calculated minus observed peptide mass in ppm
absDM	Absolute value of calculated minus observed peptide mass in Da
absDMppm	Absolute value of calculated minus observed peptide mass in ppm
isoDM	Calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in Da
isoDMppm	Calculated minus observed peptide mass, after eliminating possible isotope errors up to 2 Da, in ppm
mc	Number of missed cleavages (always 0 if no enzyme)
varmods	Number of modified sites divided by number of modifiable sites
varmodsCount	The number of variable mods used in the peptide. That is, if there are 10 Met and 5 of these are oxidised, this counts as 1. A peptide with Met-OX, phosphoS, deamidation, and acetylation, would count as 5.
modifiable	Total number of modifiable sites
modified	Total number of modified residues and terminii
totInt	Log total ion intensity. The 20 most intense peaks in each 100 Da bin are used for all features, and totInt reports this value
intMatchedTot	Log total matched ion intensity
relIntMatchedTot	Total matched ion intensity divided by total ion intensity as a percentage (no logs involved)
fragDeltaMed	Median value of all matched fragment errors in Da
fragDeltaIqr	Interquartile range value of all matched fragment errors in Da
fragDeltaMedPPM	Median value of all matched fragment errors in ppm
fragDeltaIqrPPM	Interquartile range value of all matched fragment errors in ppm
fragDeltaPolyFit	2nd order polynomial fit to m/z vs delta. Result is RSquared multiplied by the number of points divided by 100
longest	Longest sequence matched ions, reported separately for each ion series (backbone only), as with fracIonsMatched
fracIonsMatched	Fraction of calculated ions matched, reported separately for each ion series, with NLs lumped together (e.g. fracIonsMatchedB1, fracIonsMatchedB1deriv, fracIonsMatchedB2, fracIonsMatchedB2deriv)
matchedIntensity	Matched ion intensity, reported separately for each ion series, as with fracIonsMatched
qmatch	The number of peptide matches for which an ms-ms match was attempted
peptide	The peptide string that was matched
proteins	A tab separated list of accessions of proteins that contain this peptide. Must be last feature in list

One feature is treated differently from the others: retention time. If retention time is included in the peak list, so that it is available in the Mascot result file, it can be used as a feature by comparing the experimental RT values with values calculated by Percolator. To enable this:

The peak list must supply retention time information using the MGF RTINSECONDS parameter. It is not sufficient to have the information embedded in the scan title string
retentionTime must be listed in the PercolatorFeatures line in mascot.dat
In the Options section of mascot.dat, set PercolatorUseRT to 1 to turn this feature on by default. Otherwise, add the argument percolate_rt=1 to the report URL

Application Notes

Percolator will usually give a worthwhile improvement in sensitivity. There are occasions when it can fail. For example, if there are very few good matches in the search results, it may not have enough positive examples to work with.

If there are multiple, high scoring matches to a single query, the current approach is to submit only the first rank match to Percolator. The other matches to the same query are then re-scored by pro-rating the new score for the rank 1 match. Thus, if there were matches to multiple peptides which differed only in (say) I and L, all of which had the same Mascot score, they would still have the same score after re-ranking with Percolator. Similarly, if the top 3 matches had Mascot scores of 60, 50, and 40, and Percolator re-scored the rank 1 match to 54, the rank 2 and 3 matches would be re-scored to 45 and 36. This avoids anomalies, but it is not ideal. If we accept that the weighted vector of features is doing a good job of re-ranking matches from different queries, it is only logical to re-rank the alternative matches to a single query. This would allow a rank 2 match to be promoted over the rank 1 match, which cannot happen using the current approach.