Dataset Information

Mapping between databases of compounds and protein targets.

ABSTRACT: Databases that provide links between bioactive compounds and their protein targets are increasingly important in drug discovery and chemical biology. They join the expanding universes of cheminformatics via chemical structures on the one hand and bioinformatics via sequences on the other. However, it is difficult to assess the relative utility of databases without the explicit comparison of content. We have exemplified an approach to this by comparing resources that each has a different focus on bioactive chemistry (ChEMBL, DrugBank, Human Metabolome Database, and Therapeutic Target Database) both at the chemical structure and protein levels. We compared the compound sets at different representational stringencies using NCI/CADD Structure Identifiers. The overlap and uniqueness in chemical content can be broadly interpreted in the context of different data capture strategies. However, we recorded apparent anomalies, such as many compounds-in-common between the metabolite and drug databases. We also compared the content of sequences mapped to the compounds via their UniProt protein identifiers. While these were also generally interpretable in the context of individual databases we discerned differences in coverage and the types of supporting data used. For example, the target concept is applied differently between DrugBank and the Therapeutic Target Database. In ChEMBL it encompasses a broader range of mappings from chemical biology and species orthologue cross-screening in addition to drug targets per se. Our analysis should assist users not only in exploiting the synergies between these four high-value resources but also in assessing the utility of other databases at the interface of chemistry and biology.

SUBMITTER: Muresan S

PROVIDER: S-EPMC7449375 | biostudies-literature | 2012

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Mapping between databases of compounds and protein targets.

Muresan Sorel S Sitzmann Markus M Southan Christopher C

Methods in molecular biology (Clifton, N.J.) 20120101

Databases that provide links between bioactive compounds and their protein targets are increasingly important in drug discovery and chemical biology. They join the expanding universes of cheminformatics via chemical structures on the one hand and bioinformatics via sequences on the other. However, it is difficult to assess the relative utility of databases without the explicit comparison of content. We have exemplified an approach to this by comparing resources that each has a different focus on ...[more]

PMID: 22821596

Similar Datasets

Project description:Since 2004 public cheminformatic databases and their collective functionality for exploring relationships between compounds, protein sequences, literature and assay data have advanced dramatically. In parallel, commercial sources that extract and curate such relationships from journals and patents have also been expanding. This work updates a previous comparative study of databases chosen because of their bioactive content, availability of downloads and facility to select informative subsets.Where they could be calculated, extracted compounds-per-journal article were in the range of 12 to 19 but compound-per-protein counts increased with document numbers. Chemical structure filtration to facilitate standardised comparisons typically reduced source counts by between 5% and 30%. The pair-wise overlaps between 23 databases and subsets were determined, as well as changes between 2006 and 2008. While all compound sets have increased, PubChem has doubled to 14.2 million. The 2008 comparison matrix shows not only overlap but also unique content across all sources. Many of the detailed differences could be attributed to individual strategies for data selection and extraction. While there was a big increase in patent-derived structures entering PubChem since 2006, GVKBIO contains over 0.8 million unique structures from this source. Venn diagrams showed extensive overlap between compounds extracted by independent expert curation from journals by GVKBIO, WOMBAT (both commercial) and BindingDB (public) but each included unique content. In contrast, the approved drug collections from GVKBIO, MDDR (commercial) and DrugBank (public) showed surprisingly low overlap. Aggregating all commercial sources established that while 1 million compounds overlapped with PubChem 1.2 million did not.On the basis of chemical structure content per se public sources have covered an increasing proportion of commercial databases over the last two years. However, commercial products included in this study provide links between compounds and information from patents and journals at a larger scale than current public efforts. They also continue to capture a significant proportion of unique content. Our results thus demonstrate not only an encouraging overall expansion of data-supported bioactive chemical space but also that both commercial and public sources are complementary for its exploration.

Project description:BackgroundUnderstanding living systems is crucial for curing diseases. To achieve this task we have to understand biological networks based on protein-protein interactions. Bioinformatics has come up with a great amount of databases and tools that support analysts in exploring protein-protein interactions on an integrated level for knowledge discovery. They provide predictions and correlations, indicate possibilities for future experimental research and fill the gaps to complete the picture of biochemical processes. There are numerous and huge databases of protein-protein interactions used to gain insights into answering some of the many questions of systems biology. Many computational resources integrate interaction data with additional information on molecular background. However, the vast number of diverse Bioinformatics resources poses an obstacle to the goal of understanding. We present a survey of databases that enable the visual analysis of protein networks.ResultsWe selected M=10 out of N=53 resources supporting visualization, and we tested against the following set of criteria: interoperability, data integration, quantity of possible interactions, data visualization quality and data coverage. The study reveals differences in usability, visualization features and quality as well as the quantity of interactions. StringDB is the recommended first choice. CPDB presents a comprehensive dataset and IntAct lets the user change the network layout. A comprehensive comparison table is available via web. The supplementary table can be accessed on http://tinyurl.com/PPI-DB-Comparison-2015.ConclusionsOnly some web resources featuring graph visualization can be successfully applied to interactive visual analysis of protein-protein interaction. Study results underline the necessity for further enhancements of visualization integration in biochemical analysis tools. Identified challenges are data comprehensiveness, confidence, interactive feature and visualization maturing.

Dataset Information

Mapping between databases of compounds and protein targets.

Publications

Mapping between databases of compounds and protein targets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets