Dataset Information

Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag.

ABSTRACT: Compound (or chemical) databases are an invaluable resource for many scientific disciplines. Exposomics researchers need to find and identify relevant chemicals that cover the entirety of potential (chemical and other) exposures over entire lifetimes. This daunting task, with over 100 million chemicals in the largest chemical databases, coupled with broadly acknowledged knowledge gaps in these resources, leaves researchers faced with too much-yet not enough-information at the same time to perform comprehensive exposomics research. Furthermore, the improvements in analytical technologies and computational mass spectrometry workflows coupled with the rapid growth in databases and increasing demand for high throughput "big data" services from the research community present significant challenges for both data hosts and workflow developers. This article explores how to reduce candidate search spaces in non-target small molecule identification workflows, while increasing content usability in the context of environmental and exposomics analyses, so as to profit from the increasing size and information content of large compound databases, while increasing efficiency at the same time. In this article, these methods are explored using PubChem, the NORMAN Network Suspect List Exchange and the in silico fragmentation approach MetFrag. A subset of the PubChem database relevant for exposomics, PubChemLite, is presented as a database resource that can be (and has been) integrated into current workflows for high resolution mass spectrometry. Benchmarking datasets from earlier publications are used to show how experimental knowledge and existing datasets can be used to detect and fill gaps in compound databases to progressively improve large resources such as PubChem, and topic-specific subsets such as PubChemLite. PubChemLite is a living collection, updating as annotation content in PubChem is updated, and exported to allow direct integration into existing workflows such as MetFrag. The source code and files necessary to recreate or adjust this are jointly hosted between the research parties (see data availability statement). This effort shows that enhancing the FAIRness (Findability, Accessibility, Interoperability and Reusability) of open resources can mutually enhance several resources for whole community benefit. The authors explicitly welcome additional community input on ideas for future developments.

SUBMITTER: Schymanski EL

PROVIDER: S-EPMC7938590 | biostudies-literature | 2021 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag.

Schymanski Emma L EL Kondić Todor T Neumann Steffen S Thiessen Paul A PA Zhang Jian J Bolton Evan E EE

Journal of cheminformatics 20210308 1

Compound (or chemical) databases are an invaluable resource for many scientific disciplines. Exposomics researchers need to find and identify relevant chemicals that cover the entirety of potential (chemical and other) exposures over entire lifetimes. This daunting task, with over 100 million chemicals in the largest chemical databases, coupled with broadly acknowledged knowledge gaps in these resources, leaves researchers faced with too much-yet not enough-information at the same time to perfor ...[more]

PMID: 33685519

Similar Datasets

Project description:IntroductionThe International Methodology Consortium for Coded Health Information (IMeCCHI) is a collaboration of health services researchers who promote methodological advances in coded health information. The IMeCCHI-DATANETWORK initiative focuses on developing a multi-purpose distributed data infrastructure and common data model (CDM) to enable cross-border data sharing and international comparisons.MethodsIMeCCHI consortium partners from six different countries - Canada, Denmark, Italy, New Zealand, South Korea, and Switzerland - used a questionnaire to describe their original databases which differ in size, structure, content and coding systems. To standardize these data, they agreed on a CDM and mapped their population-based databases to meet the CDM specifications. At the end of this process, local data had a more homogenous content and structure, which made them syntactically and semantically interoperable. Data transformation was performed using a common data management software called TheMatrix.ResultsThe CDM encompasses four tables of structured data (person characteristics, hospitalizations, outpatient prescription medication and death), linked at the individual level through a person identifier. It can be used to answer research questions across countries using locally converted databases, which facilitates study replication in a distributed fashion. As a proof-of-concept study, an initial research question was addressed using an agreed protocol. Local data were transformed in csv files in the CDM structure and TheMatrix was tested to transform the standardized data from each partner into local analytical datasets. This allowed results to be shared between countries, whilst maintaining local control over each region's data.ConclusionThe IMeCCHI-DATANETWORK, a model of a distributed data network, demonstrated that it is feasible to analyze international data using standardized analytical methods that enable independent analyses by regions, without relocating datasets thereby protecting local confidentiality obligations. The distributed data infrastructure can produce results that can be generalized to several countries, while facilitating cross-border data sharing and international comparisons.KeywordsCommon data model, international comparison, cross-border data sharing, interoperability, observational data.

Dataset Information

Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag.

Publications

Empowering large chemical knowledge bases for exposomics: PubChemLite meets MetFrag.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets