Dataset Information

Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity.

ABSTRACT:

Background

A new algorithm has been developed to enable the interpretation of black box models. The developed algorithm is agnostic to learning algorithm and open to all structural based descriptors such as fragments, keys and hashed fingerprints. The algorithm has provided meaningful interpretation of Ames mutagenicity predictions from both random forest and support vector machine models built on a variety of structural fingerprints.A fragmentation algorithm is utilised to investigate the model's behaviour on specific substructures present in the query. An output is formulated summarising causes of activation and deactivation. The algorithm is able to identify multiple causes of activation or deactivation in addition to identifying localised deactivations where the prediction for the query is active overall. No loss in performance is seen as there is no change in the prediction; the interpretation is produced directly on the model's behaviour for the specific query.

Results

Models have been built using multiple learning algorithms including support vector machine and random forest. The models were built on public Ames mutagenicity data and a variety of fingerprint descriptors were used. These models produced a good performance in both internal and external validation with accuracies around 82%. The models were used to evaluate the interpretation algorithm. Interpretation was revealed that links closely with understood mechanisms for Ames mutagenicity.

Conclusion

This methodology allows for a greater utilisation of the predictions made by black box models and can expedite further study based on the output for a (quantitative) structure activity model. Additionally the algorithm could be utilised for chemical dataset investigation and knowledge extraction/human SAR development.

SUBMITTER: Webb SJ

PROVIDER: S-EPMC3997921 | biostudies-literature | 2014 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity.

Webb Samuel J SJ Hanser Thierry T Howlin Brendan B Krause Paul P Vessey Jonathan D JD

Journal of cheminformatics 20140325 1

<h4>Background</h4>A new algorithm has been developed to enable the interpretation of black box models. The developed algorithm is agnostic to learning algorithm and open to all structural based descriptors such as fragments, keys and hashed fingerprints. The algorithm has provided meaningful interpretation of Ames mutagenicity predictions from both random forest and support vector machine models built on a variety of structural fingerprints.A fragmentation algorithm is utilised to investigate t ...[more]

PMID: 24661325

Similar Datasets

Project description:The International Conference on Harmonization (ICH) M7 guideline allows the use of in silico approaches for predicting Ames mutagenicity for the initial assessment of impurities in pharmaceuticals. This is the first international guideline that addresses the use of quantitative structure-activity relationship (QSAR) models in lieu of actual toxicological studies for human health assessment. Therefore, QSAR models for Ames mutagenicity now require higher predictive power for identifying mutagenic chemicals. To increase the predictive power of QSAR models, larger experimental datasets from reliable sources are required. The Division of Genetics and Mutagenesis, National Institute of Health Sciences (DGM/NIHS) of Japan recently established a unique proprietary Ames mutagenicity database containing 12140 new chemicals that have not been previously used for developing QSAR models. The DGM/NIHS provided this Ames database to QSAR vendors to validate and improve their QSAR tools. The Ames/QSAR International Challenge Project was initiated in 2014 with 12 QSAR vendors testing 17 QSAR tools against these compounds in three phases. We now present the final results. All tools were considerably improved by participation in this project. Most tools achieved >50% sensitivity (positive prediction among all Ames positives) and predictive power (accuracy) was as high as 80%, almost equivalent to the inter-laboratory reproducibility of Ames tests. To further increase the predictive power of QSAR tools, accumulation of additional Ames test data is required as well as re-evaluation of some previous Ames test results. Indeed, some Ames-positive or Ames-negative chemicals may have previously been incorrectly classified because of methodological weakness, resulting in false-positive or false-negative predictions by QSAR tools. These incorrect data hamper prediction and are a source of noise in the development of QSAR models. It is thus essential to establish a large benchmark database consisting only of well-validated Ames test results to build more accurate QSAR models.

Project description:BackgroundIn drug discovery, a positive Ames test for bacterial mutation presents a significant hurdle to advancing a drug to clinical trials. In a previous paper, we discussed success in predicting the genotoxicity of reagent-sized aryl-amines (ArNH2), a structure frequently found in marketed drugs and in drug discovery, using quantum mechanics calculations of the energy required to generate the DNA-reactive nitrenium intermediate (ArNH:+). In this paper we approach the question of what molecular descriptors could improve these predictions and whether external data sets are appropriate for further training.ResultsIn trying to extend and improve this model beyond this quantum mechanical reaction energy, we faced considerable difficulty, which was surprising considering the long history and success of QSAR model development for this test. Other quantum mechanics descriptors were compared to this reaction energy including AM1 semi-empirical orbital energies, nitrenium formation with alternative leaving groups, nitrenium charge, and aryl-amine anion formation energy. Nitrenium formation energy, regardless of the starting species, was found to be the most useful single descriptor. External sets used in other QSAR investigations did not present the same difficulty using the same methods and descriptors. When considering all substructures rather than just aryl-amines, we also noted a significantly lower performance for the Novartis set. The performance gap between Novartis and external sets persists across different descriptors and learning methods. The profiles of the Novartis and external data are significantly different both in aryl-amines and considering all substructures. The Novartis and external data sets are easily separated in an unsupervised clustering using chemical fingerprints. The chemical differences are discussed and visualized using Kohonen Self-Organizing Maps trained on chemical fingerprints, mutagenic substructure prevalence, and molecular weight.ConclusionsDespite extensive work in the area of predicting this particular toxicity, work in designing and publishing more relevant test sets for compounds relevant to drug discovery is still necessary. This work also shows that great care must be taken in using QSAR models to replace experimental evidence. When considering all substructures, a random forest model, which can inherently cover distinct neighborhoods, built on Novartis data and previously reported external data provided a suitable model.

Project description:In the recent JAVELIN Bladder 100 phase 3 trial, avelumab plus best supportive care significantly prolonged overall survival relative to best supportive care alone as first-line maintenance therapy following first-line platinum-based chemotherapy in patients with advanced urothelial cancer (aUC). Discovering biomarkers using genomic profiling to understand potential patient heterogeneity is essential to help improve patient care with precision medicine. For the JAVELIN Bladder 100 trial, it is unclear which variable selection methods can most reliably identify biomarkers to inform patient care because the dataset is characterized by high collinearity and low signal. The aim of this paper was to evaluate available selection methods and their ability to discover prognostic and predictive biomarkers in patients with aUC receiving first-line maintenance therapy. A simulation study evaluated the performance of popular variable selection approaches for high-dimensional data, including penalized regression models, random survival forests, and Bayesian variable selection methods. For Bayesian variable selection methods, a modified Bayesian Information Criterion (BIC) thresholding rule was proposed in addition to the traditional BIC thresholding rule. These methods were applied to the JAVELIN Bladder 100 dataset to investigate potential biomarkers associated with survival benefit. Results from the simulations demonstrated the strengths and limitations of the different methods. The variable selection methods demonstrated low false discovery rates under different conditions. However, their performance declined in the presence of high collinearity. Using the JAVELIN Bladder 100 data, we identified some potentially significant biomarkers across multiple models. Several lasso-related methods were able to identify potentially biologically meaningful variables in the trial. Some variable selection methods (such as stochastic search variable selection and random survival forest) may not be well suited to this type of data due to the presence of extreme collinearity and low signal. Future research should explore novel variable selection methods that may be more suitable for identifying prognostic and predictive biomarkers in this population.Trial registration: ClinicalTrials.gov Identifier: NCT02603432.

Project description:The robust control of genotoxic N-nitrosamine (NA) impurities is an important safety consideration for the pharmaceutical industry, especially considering recent drug product withdrawals. NAs belong to the 'cohort of concern' list of genotoxic impurities (ICH M7) because of the mutagenic and carcinogenic potency of this chemical class. In addition, regulatory concerns exist regarding the capacity of the Ames test to predict the carcinogenic potential of NAs because of historically discordant results. The reasons postulated to explain these discordant data generally point to aspects of Ames test study design. These include vehicle solvent choice, liver S9 species, bacterial strain, compound concentration, and use of pre-incubation versus plate incorporation methods. Many of these concerns have their roots in historical data generated prior to the harmonization of Ames test guidelines. Therefore, we investigated various Ames test assay parameters and used qualitative analysis and quantitative benchmark dose modelling to identify which combinations provided the most sensitive conditions in terms of mutagenic potency. Two alkyl-nitrosamines, N-nitrosodimethylamine (NDMA) and N-nitrosodiethylamine (NDEA) were studied. NDMA and NDEA mutagenicity was readily detected in the Ames test and key assay parameters were identified that contributed to assay sensitivity rankings. The pre-incubation method (30-min incubation), appropriate vehicle (water or methanol), and hamster-induced liver S9, alongside Salmonella typhimurium strains TA100 and TA1535 and Escherichia coli strain WP2uvrA(pKM101) provide the most sensitive combination of assay parameters in terms of NDMA and NDEA mutagenic potency in the Ames test. Using these parameters and further quantitative benchmark dose modelling, we show that N-nitrosomethylethylamine (NMEA) is positive in Ames test and therefore should no longer be considered a historically discordant NA. The results presented herein define a sensitive Ames test design that can be deployed for the assessment of NAs to support robust impurity qualifications.

Project description:In view of the growth of clinical risk prediction models using genetic data, there is an increasing need for studies that use appropriate methods to select the optimum number of features from a large number of genetic variants with a high degree of redundancy between features due to linkage disequilibrium (LD). Filter feature selection methods based on information theoretic criteria, are well suited to this challenge and will identify a subset of the original variables that should result in more accurate prediction. However, data collected from cohort studies are often high-dimensional genetic data with potential confounders presenting challenges to feature selection and risk prediction machine learning models. Patients with psoriasis are at high risk of developing a chronic arthritis known as psoriatic arthritis (PsA). The prevalence of PsA in this patient group can be up to 30% and the identification of high risk patients represents an important clinical research which would allow early intervention and a reduction of disability. This also provides us with an ideal scenario for the development of clinical risk prediction models and an opportunity to explore the application of information theoretic criteria methods. In this study, we developed the feature selection and psoriatic arthritis (PsA) risk prediction models that were applied to a cross-sectional genetic dataset of 1462 PsA cases and 1132 cutaneous-only psoriasis (PsC) cases using 2-digit HLA alleles imputed using the SNP2HLA algorithm. We also developed stratification method to mitigate the impact of potential confounder features and illustrate that confounding features impact the feature selection. The mitigated dataset was used in training of seven supervised algorithms. 80% of data was randomly used for training of seven supervised machine learning methods using stratified nested cross validation and 20% was selected randomly as a holdout set for internal validation. The risk prediction models were then further validated in UK Biobank dataset containing data on 1187 participants and a set of features overlapping with the training dataset.Performance of these methods has been evaluated using the area under the curve (AUC), accuracy, precision, recall, F1 score and decision curve analysis(net benefit). The best model is selected based on three criteria: the 'lowest number of feature subset' with the 'maximal average AUC over the nested cross validation' and good generalisability to the UK Biobank dataset. In the original dataset, with over 100 different bootstraps and seven feature selection (FS) methods, HLA_C_*06 was selected as the most informative genetic variant. When the dataset is mitigated the single most important genetic features based on rank was identified as HLA_B_*27 by the seven different feature selection methods, consistent with previous analyses of this data using regression based methods. However, the predictive accuracy of these single features in post mitigation was found to be moderate (AUC= 0.54 (internal cross validation), AUC=0.53 (internal hold out set), AUC=0.55(external data set)). Sequentially adding additional HLA features based on rank improved the performance of the Random Forest classification model where 20 2-digit features selected by Interaction Capping (ICAP) demonstrated (AUC= 0.61 (internal cross validation), AUC=0.57 (internal hold out set), AUC=0.58 (external dataset)). The stratification method for mitigation of confounding features and filter information theoretic feature selection can be applied to a high dimensional dataset with the potential confounders.

Dataset Information

Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity.

Background

Results

Conclusion

Publications

Feature combination networks for the interpretation of statistical machine learning models: application to Ames mutagenicity.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets