Quantitation: Statistical procedures

Usually, identification and quantitation are performed at the peptide level. The Mascot result report assigns the peptide matches to protein hits, and the ratios for individual peptide matches are combined to determine ratios for the protein hits. The methods provided for calculating a protein ratio from a set of peptide ratios are median, average, or weighted average. The standard deviation of the peptide ratios provides a measure of the uncertainty in the protein ratio.

Since we are dealing with ratios, the average is the geometric mean and the standard deviation is the geometric standard deviation, which is a factor. In other words, the confidence interval is obtained by dividing and multiplying the average by the standard deviation, which is never less than 1.0. For example, if the average is 0.91 and SD(geo) is 1.06 then the confidence interval is 0.86 to 0.96.

Ratios for peptide matches are only reported if various quality criteria are fulfilled, the most important being:

Peptide modification state
Minimum precursor charge, (default 1)
Strength of the peptide match, defined in terms of either a minimum score, a maximum expect value, or the score being at or above either the identity threshold or the homology threshold, (default maximum expect of 0.05)
Method specific criteria, such as a minimum number of fragment ion pairs for multiplex

A ratio for a protein hit is only reported if the minimum number of peptide matches, is achieved, (default 2). If the ratios for the peptide matches are not consistent with a sample from a normal distribution, the SD(geo) value is displayed in italics, and should be treated with caution.

Testing for normality

Testing for outliers and reporting a standard deviation for the protein ratio relies on the peptide ratios being consistent with a sample from a normal distribution, (in log space). If the peptide ratios do not appear to be from a normal distribution, this may indicate that the values are meaningless, and something went systematically wrong with the the analysis. On the other hand, it may indicate something interesting, like the peptides have been mis-assigned and actually come from two proteins with very different ratios, so that the distribution is bimodal.

Shapiro-Wilk W test

In the Shapiro-Wilk W test, the null hypothesis is that the sample is taken from a normal distribution. This hypothesis is rejected if the critical value P for the test statistic W is less than 0.05. The routine used is valid for sample sizes between 3 and 2000.

References:

Outlier removal

The available methods for testing and removing outliers are none, auto, dixons, grubbs, and rosners. Choosing auto means that Dixon's method will be used if the number of values is between 4 and 25, while Rosner's method will be used if the number of values is greater than 25. If the ratios for the peptide matches are not consistent with a sample from a normal distribution, the SD(geo) value is displayed in italics, and outlier removal is skipped.

Any statistician will advise of the dangers of blindly removing outliers. The general advice is to analyze the data both with and without the outlier(s) and see if the conclusions are qualitatively different.

Dixon's method

Dixon's r11 test, also referred to as N9, is used to detect and remove a single outlier at a time from either the upper or lower extreme of the range. Critical values for a significance level 0.05 are used, as tabulated by Verma and Quiroz-Ruiz. The test is applicable to between 4 and 100 values. Each time a value is removed, the test is repeated.

References:

Grubbs' method

Grubbs' method is used to detect and remove a single outlier at a time from either the upper or lower extreme of the range. Critical values for a significance level 0.05 are used, as tabulated by Verma and Quiroz-Ruiz (Table A1 for discordancy test N1). The test is applicable to between 3 and 100 values. Each time a value is removed, the test is repeated.

References:

Rosner's method

Rosner's method will detect and remove multiple outliers in a single pass. Critical values for a significance level 0.05 are used. The test will remove up to 10 outliers from a sample of at least 25 values.

References:

Rosner, B., Percentage Points for a Generalized ESD Many-Outlier Procedure, Technometrics 25 165-172 (1983)

Protein ratio calculation

The three methods of deriving a protein hit ratio from a set of peptide ratios are median, average, and weighted.

Median: The median peptide ratio is selected to represent the protein ratio. If there are an even number of peptide ratios, the geometric mean of the median pair is used
Average: The protein ratio is the geometric average of the peptide ratios
Weighted: For each component, the intensity values of the set of peptides are summed and the protein ratio(s) calculated from the summed values. This gives a weighted average, which will be the best measure if the accuracy is limited by counting statistics. When a weighted average is reported, the standard deviation is the weighted standard deviation. This is the geometric standard deviation divided by the square root of the effective base.

Significant changes

A protein ratio is reported in bold face if it is significantly different from unity. The comparison test is:

If this inequality is true, then there is no significant difference at the stated confidence level. (N is the number of peptide ratios, s is the standard deviation and x the mean of the peptide ratios, both numbers calculated in log space. The true value of the ratio, µ, is 0 in log space. t is students t for N-1 degrees of freedom and a two-sided confidence level of 95%.)

Data normalisation strongly influences which protein ratios are shown in bold. If a large number of values are bold, this is likely to indicate that the comparison to unity has no meaning. If the ratios for the peptide matches are not consistent with a sample from a normal distribution, the SD(geo) value is displayed in italics, and the significance test is omitted.