Dataset Information

In-silico predictive mutagenicity model generation using supervised learning approaches.

ABSTRACT:

Unlabelled

Background

Experimental screening of chemical compounds for biological activity is a time consuming and expensive practice. In silico predictive models permit inexpensive, rapid "virtual screening" to prioritize selection of compounds for experimental testing. Both experimental and in silico screening can be used to test compounds for desirable or undesirable properties. Prior work on prediction of mutagenicity has primarily involved identification of toxicophores rather than whole-molecule predictive models. In this work, we examined a range of in silico predictive classification models for prediction of mutagenic properties of compounds, including methods such as J48 and SMO which have not previously been widely applied in cheminformatics.

Results

The Bursi mutagenicity data set containing 4337 compounds (Set 1) and a Benchmark data set of 6512 compounds (Set 2) were taken as input data set in this work. A third data set (Set 3) was prepared by joining up the previous two sets. Classification algorithms including Naïve Bayes, Random Forest, J48 and SMO with 10 fold cross-validation and default parameters were used for model generation on these data sets. Models built using the combined performed better than those developed from the Benchmark data set. Significantly, Random Forest outperformed other classifiers for all the data sets, especially for Set 3 with 89.27% accuracy, 89% precision and ROC of 95.3%. To validate the developed models two external data sets, AID1189 and AID1194, with mutagenicity data were tested showing 62% accuracy with 67% precision and 65% ROC area and 91% accuracy, 91% precision with 96.3% ROC area respectively. A Random Forest model was used on approved drugs from DrugBank and metabolites from the Zinc Database with True Positives rate almost 85% showing the robustness of the model.

Conclusion

We have created a new mutagenicity benchmark data set with around 8,000 compounds. Our work shows that highly accurate predictive mutagenicity models can be built using machine learning methods based on chemical descriptors and trained using this set, and these models provide a complement to toxicophores based methods. Further, our work supports other recent literature in showing that Random Forest models generally outperform other comparable machine learning methods for this kind of application.

SUBMITTER: Seal A

PROVIDER: S-EPMC3542175 | biostudies-literature | 2012 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

In-silico predictive mutagenicity model generation using supervised learning approaches.

Seal Abhik A Passi Anurag A Jaleel Uc Abdul UA Wild David J DJ

Journal of cheminformatics 20120515 1

<h4>Unlabelled</h4><h4>Background</h4>Experimental screening of chemical compounds for biological activity is a time consuming and expensive practice. In silico predictive models permit inexpensive, rapid "virtual screening" to prioritize selection of compounds for experimental testing. Both experimental and in silico screening can be used to test compounds for desirable or undesirable properties. Prior work on prediction of mutagenicity has primarily involved identification of toxicophores rath ...[more]

PMID: 22587596

Similar Datasets

Project description:The drug-resistant strains of Mycobacterium tuberculosis (M.tb) are evolving at an alarming rate, and this indicates the urgent need for the development of novel antitubercular drugs. However, genetic mutations, complex cell wall system of M.tb, and influx-efflux transporter systems are the major permeability barriers that significantly affect the M.tb drugs activity. Thus, most of the small molecules are ineffective to arrest the M.tb cell growth, even though they are effective at the cellular level. To address the permeability issue, different machine learning models that effectively distinguish permeable and impermeable compounds were developed. The enzyme-based (IC50) and cell-based (minimal inhibitory concentration) data were considered for the classification of M.tb permeable and impermeable compounds. It was assumed that the compounds that have high activity in both enzyme-based and cell-based assays possess the required M.tb cell wall permeability. The XGBoost model was outperformed when compared to the other models generated from different algorithms such as random forest, support vector machine, and naïve Bayes. The XGBoost model was further validated using the validation data set (21 permeable and 19 impermeable compounds). The obtained machine learning models suggested that various descriptors such as molecular weight, atom type, electrotopological state, hydrogen bond donor/acceptor counts, and extended topochemical atoms of molecules are the major determining factors for both M.tb cell permeability and inhibitory activity. Furthermore, potential antimycobacterial drugs were identified using computational drug repurposing. All the approved drugs from DrugBank were collected and screened using the developed permeability model. The screened compounds were given as input in the PASS server for the identification of possible antimycobacterial compounds. The drugs that were retained after two filters were docked to the active site of 10 different potential antimycobacterial drug targets. The results obtained from this study may improve the understanding of M.tb permeability and activity that may aid in the development of novel antimycobacterial drugs.

Dataset Information

In-silico predictive mutagenicity model generation using supervised learning approaches.

Unlabelled

Background

Results

Conclusion

Publications

In-silico predictive mutagenicity model generation using supervised learning approaches.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets