Dataset Information

Mining basic active structures from a large-scale database.

ABSTRACT: BACKGROUND: The Pubchem Database is a large-scale resource for chemical information, containing millions of chemical compound activities derived by high-throughput screening (HTS). The ability to extract characteristic substructures from such enormous amounts of data is steadily growing in importance. Compounds with shared basic active structures (BASs) exhibiting G-protein coupled receptor (GPCR) activity and repeated dose toxicity have been mined from small datasets. However, the mining process employed was not applicable to large datasets owing to a large imbalance between the numbers of active and inactive compounds. In most datasets, one active compound will appear for every 1000 inactive compounds. Most mining techniques work well only when these numbers are similar. RESULTS: This difficulty was overcome by sampling an equal number of active and inactive compounds. The sampling process was repeated to maintain the structural diversity of the inactive compounds. An interactive KNIME workflow that enabled effective sampling and data cleaning processes was created. The application of the cascade model and subsequent structural refinement yielded the BAS candidates. Repeated sampling increased the ratio of active compounds containing these substructures. Three samplings were deemed adequate to identify all of the meaningful BASs. BASs expressing similar structures were grouped to give the final set of BASs. This method was applied to HIV integrase and protease inhibitor activities in the MDL Drug Data Report (MDDR) database and to procaspase-3 activators in the PubChem BioAssay database, yielding 14, 12, and 18 BASs, respectively. CONCLUSIONS: The proposed mining scheme successfully extracted meaningful substructures from large datasets of chemical structures. The resulting BASs were deemed reasonable by an experienced medicinal chemist. The mining itself requires about 3 days to extract BASs with a given physiological activity. Thus, the method described herein is an effective way to analyze large HTS databases.

SUBMITTER: Takada N

PROVIDER: S-EPMC3618305 | biostudies-other | 2013

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Mining basic active structures from a large-scale database.

Takada Naoto N Ohmori Norihito N Okada Takashi T

Journal of cheminformatics 20130316 1

<h4>Background</h4>The Pubchem Database is a large-scale resource for chemical information, containing millions of chemical compound activities derived by high-throughput screening (HTS). The ability to extract characteristic substructures from such enormous amounts of data is steadily growing in importance. Compounds with shared basic active structures (BASs) exhibiting G-protein coupled receptor (GPCR) activity and repeated dose toxicity have been mined from small datasets. However, the mining ...[more]

PMID: 23497729

Dataset Information

Mining basic active structures from a large-scale database.

Publications

Mining basic active structures from a large-scale database.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Large-scale database mining reveals hidden trends and future directions for cancer immunotherapy.
| S-EPMC5993505 | biostudies-literature

Development of a Large-Scale Chemogenomics Database to Improve Drug Candidate Selection
2005-06-01 | GSE2409 | GEO

Development of a Large-Scale Chemogenomics Database to Improve Drug Candidate Selection
2005-05-31 | E-GEOD-2409 | biostudies-arrayexpress

Transcription forms and remodels supercoiling domains unfolding large scale chromatin structures
2013-02-17 | E-GEOD-43451 | biostudies-arrayexpress

A global database of large-scale transverse drainages.
| S-EPMC6369416 | biostudies-literature

Analysis of active and inactive X chromosome architecture reveals the independent organization of 30-nm and large scale chromatin structures
2010-08-27 | GSE23818 | GEO

Transcription forms and remodels supercoiling domains unfolding large scale chromatin structures
2013-02-17 | GSE43451 | GEO

MiRonTop: mining microRNAs targets across large scale gene expression studies.
| S-EPMC2995122 | biostudies-literature

Uncovering Capgras delusion using a large-scale medical records database.
| S-EPMC5541249 | biostudies-other

BCSearch: fast structural fragment mining over large collections of protein structures.
| S-EPMC4489267 | biostudies-literature