Dataset Information

Consolidating metabolite identifiers to enable contextual and multi-platform metabolomics data analysis.

ABSTRACT:

Background

Analysis of data from high-throughput experiments depends on the availability of well-structured data that describe the assayed biomolecules. Procedures for obtaining and organizing such meta-data on genes, transcripts and proteins have been streamlined in many data analysis packages, but are still lacking for metabolites. Chemical identifiers are notoriously incoherent, encompassing a wide range of different referencing schemes with varying scope and coverage. Online chemical databases use multiple types of identifiers in parallel but lack a common primary key for reliable database consolidation. Connecting identifiers of analytes found in experimental data with the identifiers of their parent metabolites in public databases can therefore be very laborious.

Results

Here we present a strategy and a software tool for integrating metabolite identifiers from local reference libraries and public databases that do not depend on a single common primary identifier. The program constructs groups of interconnected identifiers of analytes and metabolites to obtain a local metabolite-centric SQLite database. The created database can be used to map in-house identifiers and synonyms to external resources such as the KEGG database. New identifiers can be imported and directly integrated with existing data. Queries can be performed in a flexible way, both from the command line and from the statistical programming environment R, to obtain data set tailored identifier mappings.

Conclusions

Efficient cross-referencing of metabolite identifiers is a key technology for metabolomics data analysis. We provide a practical and flexible solution to this task and an open-source program, the metabolite masking tool (MetMask), available at http://metmask.sourceforge.net, that implements our ideas.

SUBMITTER: Redestig H

PROVIDER: S-EPMC2879285 | biostudies-literature | 2010 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Consolidating metabolite identifiers to enable contextual and multi-platform metabolomics data analysis.

Redestig Henning H Kusano Miyako M Fukushima Atsushi A Matsuda Fumio F Saito Kazuki K Arita Masanori M

BMC bioinformatics 20100429

<h4>Background</h4>Analysis of data from high-throughput experiments depends on the availability of well-structured data that describe the assayed biomolecules. Procedures for obtaining and organizing such meta-data on genes, transcripts and proteins have been streamlined in many data analysis packages, but are still lacking for metabolites. Chemical identifiers are notoriously incoherent, encompassing a wide range of different referencing schemes with varying scope and coverage. Online chemical ...[more]

PMID: 20426876

Similar Datasets

Project description:IntroductionThe American Hospital Association (AHA) has hospital-level data, while the Centers for Medicare & Medicaid Services (CMS) has patient-level data. Merging these with other distinct databases would permit analyses of hospital-based specialties, units, or departments, and patient outcomes. One distinct database is the National Emergency Department Inventory (NEDI), which contains information about all EDs in the United States. However, a challenge with merging these databases is that NEDI lists all US EDs individually, while the AHA and CMS group some EDs by hospital network. Consolidating data for this merge may be preferential to excluding grouped EDs. Our objectives were to consolidate ED data to enable linkage with administrative datasets and to determine the effect of excluding grouped EDs on ED-level summary results.MethodsUsing the 2014 NEDI-USA database, we surveyed all New England EDs. We individually matched NEDI EDs with corresponding EDs in the AHA and CMS. A "group match" was assigned when more than one NEDI ED was matched to a single AHA or CMS facility identification number. Within each group, we consolidated individual ED data to create a single observation based on sums or weighted averages of responses as appropriate.ResultsOf the 195 EDs in New England, 169 (87%) completed the NEDI survey. Among these, 130 (77%) EDs were individually listed in AHA and CMS, while 39 were part of groups consisting of 2-3 EDs but represented by one facility ID. Compared to the individually listed EDs, the 39 EDs included in a "group match" had a larger number of annual visits and beds, were more likely to be freestanding, and were less likely to be rural (all P<0.05). Two grouped EDs were excluded because the listed ED did not respond to the NEDI survey; the remaining 37 EDs were consolidated into 19 observations. Thus, the consolidated dataset contained 149 observations representing 171 EDs; this consolidated dataset yielded summary results that were similar to those of the 169 responding EDs.ConclusionExcluding grouped EDs would have resulted in a non-representative dataset. The original vs consolidated NEDI datasets yielded similar results and enabled linkage with large administrative datasets. This approach presents a novel opportunity to use characteristics of hospital-based specialties, units, and departments in studies of patient-level outcomes, to advance health services research.

Project description:ObjectiveTo interrogate the pathogenesis of intrauterine growth restriction (IUGR) and apply Artificial Intelligence (AI) techniques to multi-platform i.e. nuclear magnetic resonance (NMR) spectroscopy and mass spectrometry (MS) based metabolomic analysis for the prediction of IUGR.Materials and methodsMS and NMR based metabolomic analysis were performed on cord blood serum from 40 IUGR (birth weight < 10th percentile) cases and 40 controls. Three variable selection algorithms namely: Correlation-based feature selection (CFS), Partial least squares regression (PLS) and Learning Vector Quantization (LVQ) were tested for their diagnostic performance. For each selected set of metabolites and the panel consists of metabolites common in three selection algorithms so-called overlapping set (OL), support vector machine (SVM) models were developed for which parameter selection was performed busing 10-fold cross validations. Area under the receiver operating characteristics curve (AUC), sensitivity and specificity values were calculated for IUGR diagnosis. Metabolite set enrichment analysis (MSEA) was performed to identify which metabolic pathways were perturbed as a direct result of IUGR in cord blood serum.ResultsAll selected metabolites and their overlapping set achieved statistically significant accuracies in the range of 0.78-0.82 for their optimized SVM models. The model utilizing all metabolites in the dataset had an AUC = 0.91 with a sensitivity of 0.83 and specificity equal to 0.80. CFS and OL (Creatinine, C2, C4, lysoPC.a.C16.1, lysoPC.a.C20.3, lysoPC.a.C28.1, PC.aa.C24.0) showed the highest performance with sensitivity (0.87) and specificity (0.87), respectively. MSEA revealed significantly altered metabolic pathways in IUGR cases. Dysregulated pathways include: beta oxidation of very long fatty acids, oxidation of branched chain fatty acids, phospholipid biosynthesis, lysine degradation, urea cycle and fatty acid metabolism.ConclusionA systematically selected panel of metabolites was shown to accurately detect IUGR in newborn cord blood serum. Significant disturbance of hepatic function and energy generating pathways were found in IUGR cases.

Project description:IntroductionAccuracy of feature annotation and metabolite identification in biological samples is a key element in metabolomics research. However, the annotation process is often hampered by the lack of spectral reference data in experimental conditions, as well as logistical difficulties in the spectral data management and exchange of annotations between laboratories.ObjectivesTo design an open-source infrastructure allowing hosting both nuclear magnetic resonance (NMR) and mass spectra (MS), with an ergonomic Web interface and Web services to support metabolite annotation and laboratory data management.MethodsWe developed the PeakForest infrastructure, an open-source Java tool with automatic programming interfaces that can be deployed locally to organize spectral data for metabolome annotation in laboratories. Standardized operating procedures and formats were included to ensure data quality and interoperability, in line with international recommendations and FAIR principles.ResultsPeakForest is able to capture and store experimental spectral MS and NMR metadata as well as collect and display signal annotations. This modular system provides a structured database with inbuilt tools to curate information, browse and reuse spectral information in data treatment. PeakForest offers data formalization and centralization at the laboratory level, facilitating shared spectral data across laboratories and integration into public databases.ConclusionPeakForest is a comprehensive resource which addresses a technical bottleneck, namely large-scale spectral data annotation and metabolite identification for metabolomics laboratories with multiple instruments. PeakForest databases can be used in conjunction with bespoke data analysis pipelines in the Galaxy environment, offering the opportunity to meet the evolving needs of metabolomics research. Developed and tested by the French metabolomics community, PeakForest is freely-available at https://github.com/peakforest .

Dataset Information

Consolidating metabolite identifiers to enable contextual and multi-platform metabolomics data analysis.

Background

Results

Conclusions

Publications

Consolidating metabolite identifiers to enable contextual and multi-platform metabolomics data analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets