Dataset Information

Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics.

ABSTRACT: Unidentified peaks remain a major problem in untargeted metabolomics by LC-MS/MS. Confidence in peak annotations increases by combining MS/MS matching and retention time. We here show how retention times can be predicted from molecular structures. Two large, publicly available data sets were used for model training in machine learning: the Fiehn hydrophilic interaction liquid chromatography data set (HILIC) of 981 primary metabolites and biogenic amines,and the RIKEN plant specialized metabolome annotation (PlaSMA) database of 852 secondary metabolites that uses reversed-phase liquid chromatography (RPLC). Five different machine learning algorithms have been integrated into the Retip R package: the random forest, Bayesian-regularized neural network, XGBoost, light gradient-boosting machine (LightGBM), and Keras algorithms for building the retention time prediction models. A complete workflow for retention time prediction was developed in R. It can be freely downloaded from the GitHub repository (https://www.retip.app). Keras outperformed other machine learning algorithms in the test set with minimum overfitting, verified by small error differences between training, test, and validation sets. Keras yielded a mean absolute error of 0.78 min for HILIC and 0.57 min for RPLC. Retip is integrated into the mass spectrometry software tools MS-DIAL and MS-FINDER, allowing a complete compound annotation workflow. In a test application on mouse blood plasma samples, we found a 68% reduction in the number of candidate structures when searching all isomers in MS-FINDER compound identification software. Retention time prediction increases the identification rate in liquid chromatography and subsequently leads to an improved biological interpretation of metabolomics data.

SUBMITTER: Bonini P

PROVIDER: S-EPMC8715951 | biostudies-literature | 2020 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics.

Bonini Paolo P Kind Tobias T Tsugawa Hiroshi H Barupal Dinesh Kumar DK Fiehn Oliver O

Analytical chemistry 20200521 11

Unidentified peaks remain a major problem in untargeted metabolomics by LC-MS/MS. Confidence in peak annotations increases by combining MS/MS matching and retention time. We here show how retention times can be predicted from molecular structures. Two large, publicly available data sets were used for model training in machine learning: the Fiehn hydrophilic interaction liquid chromatography data set (HILIC) of 981 primary metabolites and biogenic amines,and the RIKEN plant specialized metabolome ...[more]

PMID: 32390414

Similar Datasets

Project description:Retention time information is used for metabolite annotation in metabolomic experiments. But its usefulness is hindered by the availability of experimental retention time data in metabolomic databases, and by the lack of reproducibility between different chromatographic methods. Accurate prediction of retention time for a given chromatographic method would be a valuable support for metabolite annotation. We have trained state-of-the-art machine learning regressors using the 80, 038 experimental retention times from the METLIN Small Molecule Retention Tim (SMRT) dataset. The models included deep neural networks, deep kernel learning, several gradient boosting models, and a blending approach. 5, 666 molecular descriptors and 2, 214 fingerprints (MACCS166, Extended Connectivity, and Path Fingerprints fingerprints) were generated with the alvaDesc software. The models were trained using only the descriptors, only the fingerprints, and both types of features simultaneously. Bayesian hyperparameter search was used for parameter tuning. To avoid data-leakage when reporting the performance metrics, nested cross-validation was employed. The best results were obtained by a heavily regularized deep neural network trained with cosine annealing warm restarts and stochastic weight averaging, achieving a mean and median absolute errors of [Formula: see text] and [Formula: see text], respectively. To the best of our knowledge, these are the most accurate predictions published up to date over the SMRT dataset. To project retention times between chromatographic methods, a novel Bayesian meta-learning approach that can learn from just a few molecules is proposed. By applying this projection between the deep neural network retention time predictions and a given chromatographic method, our approach can be integrated into a metabolite annotation workflow to obtain z-scores for the candidate annotations. To this end, it is enough that just as few as 10 molecules of a given experiment have been identified (probably by using pure metabolite standards). The use of z-scores permits considering the uncertainty in the projection when ranking candidates, and not only the accuracy. In this scenario, our results show that in 68% of the cases the correct molecule was among the top three candidates filtered by mass and ranked according to z-scores. This shows the usefulness of this information to support metabolite annotation. Python code is available on GitHub at https://github.com/constantino-garcia/cmmrt.

Project description:Untargeted approaches and thus biological interpretation of metabolomics results are still hampered by the reliable assignment of the global metabolome as well as classification and (putative) identification of metabolites. In this work we present an liquid chromatography-mass spectrometry (LC-MS)-based stable isotope assisted approach that combines global metabolome and tracer based isotope labeling for improved characterization of (unknown) metabolites and their classification into tracer derived submetabolomes. To this end, wheat plants were cultivated in a customized growth chamber, which was kept at 400 ± 50 ppm 13CO2 to produce highly enriched uniformly 13C-labeled sample material. Additionally, native plants were grown in the greenhouse and treated with either 13C9-labeled phenylalanine (Phe) or 13C11-labeled tryptophan (Trp) to study their metabolism and biochemical pathways. After sample preparation, liquid chromatography-high resolution mass spectrometry (LC-HRMS) analysis and automated data evaluation, the results of the global metabolome- and tracer-labeling approaches were combined. A total of 1,729 plant metabolites were detected out of which 122 respective 58 metabolites account for the Phe- and Trp-derived submetabolomes. Besides m/z and retention time, also the total number of carbon atoms as well as those of the incorporated tracer moieties were obtained for the detected metabolite ions. With this information at hand characterization of unknown compounds was improved as the additional knowledge from the tracer approaches considerably reduced the number of plausible sum formulas and structures of the detected metabolites. Finally, the number of putative structure formulas was further reduced by isotope-assisted annotation tandem mass spectrometry (MS/MS) derived product ion spectra of the detected metabolites. A major innovation of this paper is the classification of the metabolites into submetabolomes which turned out to be valuable information for effective filtering of database hits based on characteristic structural subparts. This allows the generation of a final list of true plant metabolites, which can be characterized at different levels of specificity.

Dataset Information

Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics.

Publications

Retip: Retention Time Prediction for Compound Annotation in Untargeted Metabolomics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets