Dataset Information

Systematic Errors in Peptide and Protein Identification and Quantification by Modified Peptides.

ABSTRACT: The principle of shotgun proteomics is to use peptide mass spectra in order to identify corresponding sequences in a protein database. The quality of peptide and protein identification and quantification critically depends on the sensitivity and specificity of this assignment process. Many peptides in proteomic samples carry biochemical modifications, and a large fraction of unassigned spectra arise from modified peptides. Spectra derived from modified peptides can erroneously be assigned to wrong amino acid sequences. However, the impact of this problem on proteomic data has not yet been investigated systematically. Here we use combinations of different database searches to show that modified peptides can be responsible for 20-50% of false positive identifications in deep proteomic data sets. These false positive hits are particularly problematic as they have significantly higher scores and higher intensities than other false positive matches. Furthermore, these wrong peptide assignments lead to hundreds of false protein identifications and systematic biases in protein quantification. We devise a "cleaned search" strategy to address this problem and show that this considerably improves the sensitivity and specificity of proteomic data. In summary, we show that modified peptides cause systematic errors in peptide and protein identification and quantification and should therefore be considered to further improve the quality of proteomic data annotation.

SUBMITTER: Bogdanow B

PROVIDER: S-EPMC4974352 | biostudies-literature | 2016 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Systematic Errors in Peptide and Protein Identification and Quantification by Modified Peptides.

Bogdanow Boris B Zauber Henrik H Selbach Matthias M

Molecular & cellular proteomics : MCP 20160523 8

The principle of shotgun proteomics is to use peptide mass spectra in order to identify corresponding sequences in a protein database. The quality of peptide and protein identification and quantification critically depends on the sensitivity and specificity of this assignment process. Many peptides in proteomic samples carry biochemical modifications, and a large fraction of unassigned spectra arise from modified peptides. Spectra derived from modified peptides can erroneously be assigned to wro ...[more]

PMID: 27215553

Similar Datasets

Project description:BackgroundMass spectrometry based quantification of peptides can be performed using the iTRAQ reagent in conjunction with mass spectrometry. This technology yields information about the relative abundance of single peptides. A method for the calculation of reliable quantification information is required in order to obtain biologically relevant data at the protein expression level.ResultsA method comprising sound error estimation and statistical methods is presented that allows precise abundance analysis plus error calculation at the peptide as well as at the protein level. This yields the relevant information that is required for quantitative proteomics. Comparing the performance of our method named Quant with existing approaches the error estimation is reliable and offers information for precise bioinformatic models. Quant is shown to generate results that are consistent with those produced by ProQuant, thus validating both systems. Moreover, the results are consistent with that of Mascot 2.2. The MATLAB scripts of Quant are freely available via http://www.protein-ms.de and http://sourceforge.net/projects/protms/, each under the GNU Lesser General Public License.ConclusionThe software Quant demonstrates improvements in protein quantification using iTRAQ. Precise quantification data can be obtained at the protein level when using error propagation and adequate visualization. Quant integrates both and additionally provides the possibility to obtain more reliable results by calculation of wise quality measures. Peak area integration has been replaced by sum of intensities, yielding more reliable quantification results. Additionally, Quant allows the combination of quantitative information obtained by iTRAQ with peptide and protein identifications from popular tandem MS identification tools. Hence Quant is a useful tool for the proteomics community and may help improving analysis of proteomic experimental data. In addition, we have shown that a lognormal distribution fits the data of mass spectrometry based relative peptide quantification.

Project description:While the importance of random sequencing errors decreases at higher DNA or RNA sequencing depths, systematic sequencing errors (SSEs) dominate at high sequencing depths and can be difficult to distinguish from biological variants. These SSEs can cause base quality scores to underestimate the probability of error at certain genomic positions, resulting in false positive variant calls, particularly in mixtures such as samples with RNA editing, tumors, circulating tumor cells, bacteria, mitochondrial heteroplasmy, or pooled DNA. Most algorithms proposed for correction of SSEs require a training data set, which is typically either from a part of the data set being M-bM-^@M-^\recalibratedM-bM-^@M-^] (Genome Analysis ToolKit, or GATK) or from a separate data set with special characteristics (SysCall). Here, we combine the advantages of these approaches by adding synthetic RNA spike-in standards to human RNA, and use GATK to recalibrate base quality scores with reads mapped to the spike-in standards. Compared to conventional GATK recalibration that uses reads mapped to the genome, spike-ins improve the accuracy of Illumina base quality scores by a mean of 5 units, and by as much as 13 units M-BM- at CpG sites. In addition, since reads mapping to the genome are not used for recalibration, our method allows run-specific recalibration even for the many species without a comprehensive and accurate SNP database. We also use GATK with the spike-in standards to demonstrate that the Illumina RNA sequencing runs overestimate quality scores for AC, CC, GC, GG, and TC dinucleotides, while SOLiD has less dinucleotide SSEs but more SSEs for certain cycles. We conclude that using these DNA and RNA spike-in standards with GATK improves base quality score recalibration. Four human RNA samples with equimolar ERCC spike-in standards were sequenced on Illumina. Two human brain/liver/muscle RNA mixtures with dynamic range of ERCC spike-in standards were sequenced on SOLiD.

Dataset Information

Systematic Errors in Peptide and Protein Identification and Quantification by Modified Peptides.

Publications

Systematic Errors in Peptide and Protein Identification and Quantification by Modified Peptides.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets