Matrix Science
Home Mascot Help  
   
  Help > Quantitation > Statistical procedures   
 
 
 
On this page
Testing for normality
Outlier removal
Protein ratio calculation
Significant changes
Quantitation topics
Overview
Report format
Configuration
Statistical procedures
Reporter protocol
Precursor protocol
Multiplex protocol
Replicate protocol
emPAI protocol
Average protocol
 

Quantitation: Statistical procedures

Usually, identification and quantitation are performed at the peptide level. The Mascot result report assigns the peptide matches to protein hits, and the ratios for individual peptide matches are combined to determine ratios for the protein hits. The methods provided for calculating a protein ratio from a set of peptide ratios are median, average, or weighted average. The standard deviation of the peptide ratios provides a measure of the uncertainty in the protein ratio.

Since we are dealing with ratios, the average is the geometric mean and the standard deviation is the geometric standard deviation, which is a factor. In other words, the confidence interval is obtained by dividing and multiplying the average by the standard deviation, which is never less than 1.0. For example, if the average is 0.91 and SD(geo) is 1.06 then the confidence interval is 0.86 to 0.96.

Ratios for peptide matches are only reported if various quality criteria are fulfilled, the most important being:

  • Peptide modification state
  • Minimum precursor charge, (default 1)
  • Strength of the peptide match, defined in terms of either a minimum score, a maximum expect value, or the score being at or above either the identity threshold or the homology threshold, (default maximum expect of 0.05)
  • Method specific criteria, such as a minimum number of fragment ion pairs for multiplex

A ratio for a protein hit is only reported if the minimum number of peptide matches, is achieved, (default 2). If the ratios for the peptide matches are not consistent with a sample from a normal distribution, the SD(geo) value is displayed in italics, and should be treated with caution.

Testing for normality

Testing for outliers and reporting a standard deviation for the protein ratio relies on the peptide ratios being consistent with a sample from a normal distribution, (in log space). If the peptide ratios do not appear to be from a normal distribution, this may indicate that the values are meaningless, and something went systematically wrong with the the analysis. On the other hand, it may indicate something interesting, like the peptides have been mis-assigned and actually come from two proteins with very different ratios, so that the distribution is bimodal.

Shapiro-Wilk W test

In the Shapiro-Wilk W test, the null hypothesis is that the sample is taken from a normal distribution. This hypothesis is rejected if the critical value P for the test statistic W is less than 0.05. The routine used is valid for sample sizes between 3 and 2000.

References:

  1. Royston, J. P., An Extension of Shapiro and Wilk's W Test for Normality to Large Samples, Applied Statistics 31 115-124 (1982)
  2. Royston, P., Remark AS R94: A Remark on Algorithm AS 181: The W-test for Normality, Applied Statistics 44 547-551 (1995)

Outlier removal

The available methods for testing and removing outliers are none, auto, dixons, grubbs, and rosners. Choosing auto means that Dixon's method will be used if the number of values is between 4 and 25, while Rosner's method will be used if the number of values is greater than 25. If the ratios for the peptide matches are not consistent with a sample from a normal distribution, the SD(geo) value is displayed in italics, and outlier removal is skipped.

Any statistician will advise of the dangers of blindly removing outliers. The general advice is to analyze the data both with and without the outlier(s) and see if the conclusions are qualitatively different.

Dixon's method

Dixon's r11 test, also referred to as N9, is used to detect and remove a single outlier at a time from either the upper or lower extreme of the range. Critical values for a significance level 0.05 are used, as tabulated by Verma and Quiroz-Ruiz. The test is applicable to between 4 and 100 values. Each time a value is removed, the test is repeated.

References:

  1. Dixon, W. J., Processing Data for Outliers, Biometrics 9 74-89 (1953)
  2. Verma, S. P. and Quiroz-Ruiz, A., Critical values for six Dixon tests for outliers in normal samples up to sizes 100, and applications in science and engineering, Revista Mexicana de Ciencias Geológicas, 23 133-161 (2006)

Grubbs' method

Grubbs' method is used to detect and remove a single outlier at a time from either the upper or lower extreme of the range. Critical values for a significance level 0.05 are used, as tabulated by Verma and Quiroz-Ruiz (Table A1 for discordancy test N1). The test is applicable to between 3 and 100 values. Each time a value is removed, the test is repeated.

References:

  1. Grubbs, F. E., Procedures for Detecting Outlying Observations in Samples, Technometrics 11 1-21 (1969)
  2. Verma, S. P. and Quiroz-Ruiz, A., Critical values for 22 discordancy test variants for outliers in normal samples up to sizes 100, and applications in science and engineering, Revista Mexicana de Ciencias Geológicas, 23 302-319 (2006)

Rosner's method

Rosner's method will detect and remove multiple outliers in a single pass. Critical values for a significance level 0.05 are used. The test will remove up to 10 outliers from a sample of at least 25 values.

References:

  1. Rosner, B., Percentage Points for a Generalized ESD Many-Outlier Procedure, Technometrics 25 165-172 (1983)

Protein ratio calculation

The three methods of deriving a protein hit ratio from a set of peptide ratios are median, average, and weighted.
  • Median: The median peptide ratio is selected to represent the protein ratio. If there are an even number of peptide ratios, the geometric mean of the median pair is used
  • Average: The protein ratio is the geometric average of the peptide ratios
  • Weighted: For each component, the intensity values of the set of peptides are summed and the protein ratio(s) calculated from the summed values. This gives a weighted average, which will be the best measure if the accuracy is limited by counting statistics. When a weighted average is reported, the standard deviation is the weighted standard deviation. This is the geometric standard deviation divided by the square root of the effective base.

Significant changes

A protein ratio is reported in bold face if it is significantly different from unity. The comparison test is:

equation

If this inequality is true, then there is no significant difference at the stated confidence level. (N is the number of peptide ratios, s is the standard deviation and x the mean of the peptide ratios, both numbers calculated in log space. The true value of the ratio, µ, is 0 in log space. t is students t for N-1 degrees of freedom and a two-sided confidence level of 95%.)

Data normalisation strongly influences which protein ratios are shown in bold. If a large number of values are bold, this is likely to indicate that the comparison to unity has no meaning. If the ratios for the peptide matches are not consistent with a sample from a normal distribution, the SD(geo) value is displayed in italics, and the significance test is omitted.

 
 
Copyright © 2012 Matrix Science Ltd. All Rights Reserved.