Enhancing Top-Down Proteomics Data Analysis by Combining Deconvolution Results through a Machine Learning Strategy
Ontology highlight
ABSTRACT: Top-down mass spectrometry (MS) is a powerful tool for identification and comprehensive characterization of proteoforms arising from alternative splicing, sequence variation, and post-translational modifications. While the technique is powerful, it suffered from the complex dataset generated from top-down MS experiments, which requires sequential data processing steps for data interpretation. Deconvolution of the complex isotopic distribution that arises from naturally occurring isotopes is a critical step in the data processing process. Multiple algorithms are currently available to deconvolute top-down mass spectra; however, each algorithm generates different deconvoluted peak lists with varied accuracy comparing to true positive annotations. In this study, we have designed a machine learning strategy that can process and combine the peak lists from different deconvolution results. By optimizing clustering results, deconvolution results from THRASH, TopFD, MS-Deconv, and SNAP algorithms were combined into consensus peak lists at various thresholds using either a simple voting ensemble method or a random forest machine learning algorithm. The random forest model outperformed the single best algorithm. This machine learning strategy could enhance the accuracy and confidence in protein identification during database search by accelerating detection of true positive peaks while filtering out false positive peaks. Thus, this method showed promises in enhancing proteoform identification and characterization for high-throughput data analysis in top-down proteomics.
INSTRUMENT(S): Bruker Daltonics solarix series
ORGANISM(S): Macaca Mulatta (rhesus Macaque)
TISSUE(S): Skeletal Muscle Fiber
SUBMITTER: Zhijie Wu
LAB HEAD: Sean J McIlwain
PROVIDER: PXD018043 | Pride | 2020-05-06
REPOSITORIES: Pride
ACCESS DATA