Dataset Information

Enhancing Top-Down Proteomics Data Analysis by Combining Deconvolution Results through a Machine Learning Strategy

ABSTRACT: Top-down mass spectrometry (MS) is a powerful tool for identification and comprehensive characterization of proteoforms arising from alternative splicing, sequence variation, and post-translational modifications. While the technique is powerful, it suffered from the complex dataset generated from top-down MS experiments, which requires sequential data processing steps for data interpretation. Deconvolution of the complex isotopic distribution that arises from naturally occurring isotopes is a critical step in the data processing process. Multiple algorithms are currently available to deconvolute top-down mass spectra; however, each algorithm generates different deconvoluted peak lists with varied accuracy comparing to true positive annotations. In this study, we have designed a machine learning strategy that can process and combine the peak lists from different deconvolution results. By optimizing clustering results, deconvolution results from THRASH, TopFD, MS-Deconv, and SNAP algorithms were combined into consensus peak lists at various thresholds using either a simple voting ensemble method or a random forest machine learning algorithm. The random forest model outperformed the single best algorithm. This machine learning strategy could enhance the accuracy and confidence in protein identification during database search by accelerating detection of true positive peaks while filtering out false positive peaks. Thus, this method showed promises in enhancing proteoform identification and characterization for high-throughput data analysis in top-down proteomics.

INSTRUMENT(S): Bruker Daltonics solarix series

ORGANISM(S): Macaca Mulatta (rhesus Macaque)

TISSUE(S): Skeletal Muscle Fiber

SUBMITTER: Zhijie Wu

LAB HEAD: Sean J McIlwain

PROVIDER: PXD018043 | Pride | 2020-05-06

REPOSITORIES: pride

ACCESS DATA

Dataset's files

Source:

			Action	DRS
	Database.zip	Other
	LDB3_CID_20180612_Rh2426_F2-4_CID_6V_998.1mz_3width_2M_30_45.0s_100scans_000001.mzXML	Mzxml
	LDB3_ECD_20170616_F2-6_042917_ECD_910mz_2.5width_0.7V_20ms_1.0s_1000scans_000001.mzXML	Mzxml
	MLC-1F_CID_20170620_F1-12_042917_CID_870.6mz_3width_8V_2M_0.08s_300scans_000001.mzXML	Mzxml
	MLC-1F_ECD_20170620_F1-12_042917_ECD_870.6mz_3width_0.8V_25ms_2M_0.2s_1000scans_000001.mzXML	Mzxml

Items per page:

1 - 5 of 34

Publications

Enhancing Top-Down Proteomics Data Analysis by Combining Deconvolution Results through a Machine Learning Strategy.

McIlwain Sean J SJ Wu Zhijie Z Wetzel Molly M Belongia Daniel D Jin Yutong Y Wenger Kent K Ong Irene M IM Ge Ying Y

Journal of the American Society for Mass Spectrometry 20200408 5

Top-down mass spectrometry (MS) is a powerful tool for the identification and comprehensive characterization of proteoforms arising from alternative splicing, sequence variation, and post-translational modifications. However, the complex data set generated from top-down MS experiments requires multiple sequential data processing steps to successfully interpret the data for identifying and characterizing proteoforms. One critical step is the deconvolution of the complex isotopic distribution that ...[more]

PMID: 32223200

Similar Datasets

Project description:Objectives Our goal was to evaluate the diagnostic value of DNA methylation analysis in combination with machine learning to differentiate pleural mesothelioma (PM) from important histopathological mimics. Material and methods DNA methylation data of PM, lung adenocarcinomas, lung squamous cell carcinomas and chronic pleuritis was used to train a random forest as well as a support vector machine. These classifiers were validated using an independent validation cohort including pleural carcinosis and pleomorphic variants of lung adeno- and squamous cell carcinomas. Furthermore, we used a deconvolution method to estimate the composition of the tumor microenvironment. Results T-distributed stochastic neighbor embedding clearly separated PM from lung adenocarcinomas and squamous cell carcinomas, but there was a considerable overlap between chronic pleuritis specimens and PM with low tumor cell content. While both machine learning algorithms achieved comparable accuracies in a nested cross validation on the training cohort (random forest: 94.9%; support vector machine: 95.5%), the support vector machine outperformed the random forest in distinguishing PM from chronic pleuritis. Differential methylation analysis revealed promoter hypermethylation in PM specimens, including the tumor suppressor genes BCL11B, EBF1, FOXA1, and WNK2. Furthermore, we observed comparable accuracies for the support vector machine on the validation cohort (97.1%) while the random forest performed considerably worse (89.9%). Deconvolution of the stromal and immune cell composition revealed higher rates of regulatory T-cells and endothelial cells in tumor specimens and a heterogenous inflammation including macrophages, B-cells and natural killer cells in chronic pleuritis. Conclusion DNA methylation in combination with machine learning is a promising tool to reliably differentiate PM from chronic pleuritis and lung cancer, including pleomorphic carcinomas. Furthermore, our study highlights new candidate genes for PM carcinogenesis and shows that deconvolution of DNA methylation data can provide reasonable insights into the composition of the tumor microenvironment.