Project description:Deep learning has achieved a notable success in mass spectrometry-based proteomics and is now emerging in glycoproteomics. While various deep learning models can predict fragment mass spectra of peptides with good accuracy, they cannot cope with the non-linear glycan structure in an intact glycopeptide. Herein, we propose a deep learning-based approach for the prediction of fragment spectra of intact glycopeptides. Our model adopts tree-structured long-short term memory networks to process the glycan moiety and a graph neural network architecture to incorporate potential fragmentation pathways of a specific glycan structure. This feature is beneficial to model explainability and differentiation ability of glycan structural isomers. We further demonstrated that predicted spectral libraries can be used for analyzing DIA data of glycopeptides as a supplement for library completeness. We expect that this work will provide a valuable deep learning resource for glycoproteomics.
Project description:In this study, we extended the ProteomeTools peptide library (PROPEL, see PXD004732 and PXD010595) to train a deep neural network resulting in chromatographic retention time and fragment ion intensity predictions for (tryptic) peptides that exceed the quality of the experimental data.
Project description:Chemical derivatization is a powerful strategy to enhance sensitivity and selectivity of liquid chromatography-mass spectrometry for non-targeted analysis of chemicals in complex mixtures. However, it remains impossible to obtain large sets of reference spectra for chemically derived molecules (CDMs), representing a major barrier in real-world applications. Herein, we describe a deep learning approach that enables accurate prediction of electrospray ionization tandem mass spectra for CDMs (DeepCDM). DeepCDM is established by transfer learning from a generic spectrum predicting model using a small set of experimentally acquired tandem mass spectra of CDMs, which converts a generic model with low predictability for CDMs into a specialized model with high predictability. We demonstrate DeepCDM by predicting electrospray ionization tandem mass spectra of dansylated molecules. The success in establishing Dns-MS further enables the development of DnsBank, a dansylation-specialized in silico spectral library. DnsBank achieves significant increases of accurate annotation rates of dansylated molecules, facilitating discovery of new hazardous pollutants from an environmental study of leather industrial wastewater. DeepCDM is also highly versatile for other classes of CDMs. Therefore, we envision that DeepCDM will pave a way for high-throughput identification of CDMs in non-targeted analysis to dig unknowns with potential health impacts from emerging anthropogenic chemicals.
Project description:"The ProteomeTools project aims to derive molecular and digital tools from the human proteome to facilitate biomedical and life science research. Here, we describe the second iteration of the generation and multimodal LC-MS/MS analysis of >220,000 synthetic tryptic peptides representing nearly all canonical human gene products. This resource will be extended to 1.4 million peptides and all data will be made available to the public in ProteomicsDB."
Project description:SummaryThis manuscript describes an open-source program, DrawGlycan-SNFG (version 2), that accepts IUPAC (International Union of Pure & Applied Chemist)-condensed inputs to render Symbol Nomenclature For Glycans (SNFG) drawings. A wide range of local and global options enable display of various glycan/peptide modifications including bond breakages, adducts, repeat structures, ambiguous identifications, etc. These facilities make DrawGlycan-SNFG ideal for integration into various glycoinformatics software, including glycomics and glycoproteomics mass spectrometry applications. As a demonstration of such usage, we incorporated DrawGlycan-SNFG into gpAnnotate, a standalone application to score and annotate individual MS/MS glycopeptide spectrum in different fragmentation modes.Availability and implementationDrawGlycan-SNFG and gpAnnotate are platform independent. While originally coded using MATLAB, compiled packages are also provided to enable DrawGlycan-SNFG implementation in Python and Java. All programs are available from https://virtualglycome.org/drawglycan; https://virtualglycome.org/gpAnnotate.Supplementary informationSupplementary Material are available at Bioinformatics online.
Project description:Spectral similarity calculation is widely used in protein identification tools and mass spectra clustering algorithms while comparing theoretical or experimental spectra. The performance of the spectral similarity calculation plays an important role in these tools and algorithms especially in the analysis of large-scale datasets. Recently, deep learning methods have been proposed to improve the performance of clustering algorithms and protein identification by training the algorithms with existing data and the use of multiple spectra and identified peptide features. While the efficiency of these algorithms is still under study in comparison with traditional approaches, their application in proteomics data analysis is becoming more common. Here, we propose the use of deep learning to improve spectral similarity comparison. We assessed the performance of deep learning for spectral similarity, with GLEAMS and a newly trained embedder model (DLEAMSE), which uses high-quality spectra from PRIDE Cluster. Also, we developed a new bioinformatics tool (mslookup - https://github.com/bigbio/DLEAMSE/) that allows users to quickly search for spectra in previously identified mass spectra publish in public repositories and spectral libraries. Finally, we released a human database to enable bioinformaticians and biologists to search for identified spectra in their machines. SIGNIFICANCE STATEMENT: Spectral similarity calculation plays an important role in proteomics data analysis. With deep learning's ability to learn the implicit and effective features from large-scale training datasets, deep learning-based MS/MS spectra embedding models has emerged as a solution to improve mass spectral clustering similarity calculation algorithms. We compare multiple similarity scoring and deep learning methods in terms of accuracy (compute the similarity for a pair of the mass spectrum) and computing-time performance. The benchmark results showed no major differences in accuracy between DLEAMSE and normalized dot product for spectrum similarity calculations. The DLEAMSE GPU implementation is faster than NDP in preprocessing on the GPU server and the similarity calculation of DLEAMSE (Euclidean distance on 32-D vectors) takes about 1/3 of dot product calculations. The deep learning model (DLEAMSE) encoding and embedding steps needed to run once for each spectrum and the embedded 32-D points can be persisted in the repository for future comparison, which is faster for future comparisons and large-scale data. Based on these, we proposed a new tool mslookup that enables the researcher to find spectra previously identified in public data. The tool can be also used to generate in-house databases of previously identified spectra to share with other laboratories and consortiums.
Project description:In proteomics, database search programs are routinely used for peptide identification from tandem mass spectrometry data. However, many low-quality spectra cannot be interpreted by any programs. Meanwhile, certain high-quality spectra may not be identified due to incompleteness of the database or failure of the software. Thus, spectrum quality assessment tools are helpful programs that can eliminate poor-quality spectra before the database search and highlight the high-quality spectra that are not identified in the initial search. These spectra may be valuable candidates for further analyses. We propose SPEQ: a spectrum quality assessment tool that uses a deep neural network to classify spectra into high-quality, which are worthy candidates for interpretation, and low-quality, which lack sufficient information for identification. SPEQ was compared with a few other prediction models and demonstrated improved prediction accuracy. Source code and scripts are freely available at github.com/sor8sh/SPEQ, implemented in Python. Supplementary data are available at Bioinformatics online.
Project description:The ability to predict tandem mass (MS/MS) spectra from peptide sequences can significantly enhance our understanding of the peptide fragmentation process and could improve peptide identification in proteomics. However, current approaches for predicting high-energy collisional dissociation (HCD) spectra are limited to predict the intensities of expected ion types, that is, the a/b/c/x/y/z ions and their neutral loss derivatives (referred to as backbone ions). In practice, backbone ions only account for <70% of total ion intensities in HCD spectra, indicating many intense ions are ignored by current predictors. In this paper, we present a deep learning approach that can predict the complete spectra (both backbone and nonbackbone ions) directly from peptide sequences. We made no assumptions or expectations on which kind of ions to predict but instead predicting the intensities for all possible m/z. Training this model needs no annotations of fragment ion nor any prior knowledge of the fragmentation rules. Our analyses show that the predicted 2+ and 3+ HCD spectra are highly similar to the experimental spectra, with average full-spectrum cosine similarities of 0.820 (±0.088) and 0.786 (±0.085), respectively, very close to the similarities between the experimental replicated spectra. In contrast, the best-performed backbone only models can only achieve an average similarity below 0.75 and 0.70 for 2+ and 3+ spectra, respectively. Furthermore, we developed a multitask learning (MTL) approach for predicting spectra of insufficient training samples, which allows our model to make accurate predictions for electron transfer dissociation (ETD) spectra and HCD spectra of less abundant charges (1+ and 4+).
Project description:Mass spectrometry data is one of the key sources of information in many workflows in medicine and across the life sciences. Mass fragmentation spectra are generally considered to be characteristic signatures of the chemical compound they originate from, yet the chemical structure itself usually cannot be easily deduced from the spectrum. Often, spectral similarity measures are used as a proxy for structural similarity but this approach is strongly limited by a generally poor correlation between both metrics. Here, we propose MS2DeepScore: a novel Siamese neural network to predict the structural similarity between two chemical structures solely based on their MS/MS fragmentation spectra. Using a cleaned dataset of > 100,000 mass spectra of about 15,000 unique known compounds, we trained MS2DeepScore to predict structural similarity scores for spectrum pairs with high accuracy. In addition, sampling different model varieties through Monte-Carlo Dropout is used to further improve the predictions and assess the model's prediction uncertainty. On 3600 spectra of 500 unseen compounds, MS2DeepScore is able to identify highly-reliable structural matches and to predict Tanimoto scores for pairs of molecules based on their fragment spectra with a root mean squared error of about 0.15. Furthermore, the prediction uncertainty estimate can be used to select a subset of predictions with a root mean squared error of about 0.1. Furthermore, we demonstrate that MS2DeepScore outperforms classical spectral similarity measures in retrieving chemically related compound pairs from large mass spectral datasets, thereby illustrating its potential for spectral library matching. Finally, MS2DeepScore can also be used to create chemically meaningful mass spectral embeddings that could be used to cluster large numbers of spectra. Added to the recently introduced unsupervised Spec2Vec metric, we believe that machine learning-supported mass spectral similarity measures have great potential for a range of metabolomics data processing pipelines.
Project description:The majority of tandem mass spectrometry (MS/MS) spectra in untargeted metabolomics and exposomics studies lack any annotation. Our deep learning framework, Integrated Data Science Laboratory for Metabolomics and Exposomics-Mass INTerpreter (IDSL_MINT) can translate MS/MS spectra into molecular fingerprint descriptors. IDSL_MINT allows users to leverage the power of the transformer model for mass spectrometry data, similar to the large language models. Models are trained on user-provided reference MS/MS libraries via any customizable molecular fingerprint descriptors. IDSL_MINT was benchmarked using the LipidMaps database and improved the annotation rate of a test study for MS/MS spectra that were not originally annotated using existing mass spectral libraries. IDSL_MINT may improve the overall annotation rates in untargeted metabolomics and exposomics studies. The IDSL_MINT framework and tutorials are available in the GitHub repository at https://github.com/idslme/IDSL_MINT .Scientific contribution statement.Structural annotation of MS/MS spectra from untargeted metabolomics and exposomics datasets is a major bottleneck in gaining new biological insights. Machine learning models to convert spectra into molecular fingerprints can help in the annotation process. Here, we present IDSL_MINT, a new, easy-to-use and customizable deep-learning framework to train and utilize new models to predict molecular fingerprints from spectra for the compound annotation workflows.