Dataset Information

Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets.

ABSTRACT: Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to generate useful machine learning models where there is a paucity of experimental data. We took highly accurate data across six kinase types, one GPCR, one polymerase, a human protease, and HIV protease, and intentionally introduced error at varying population proportions in the datasets for each target. With the generated error in the data, we explored how the retrospective accuracy of a Naïve Bayes Network, a Random Forest Model, and a Probabilistic Neural Network model decayed as a function of error. Additionally, we explored the ability of a training dataset with an error profile resembling that produced by the Free Energy Perturbation method (FEP+) to generate machine learning models with useful retrospective capabilities. The categorical error tolerance was quite high for a Naïve Bayes Network algorithm averaging 39% error in the training set required to lose predictivity on the test set. Additionally, a Random Forest tolerated a significant degree of categorical error introduced into the training set with an average error of 29% required to lose predictivity. However, we found the Probabilistic Neural Network algorithm did not tolerate as much categorical error requiring an average of 20% error to lose predictivity. Finally, we found that a Naïve Bayes Network and a Random Forest could both use datasets with an error profile resembling that of FEP+. This work demonstrates that computational methods of known error distribution like FEP+ may be useful in generating machine learning models not based on extensive and expensive in vitro-generated datasets.

SUBMITTER: Kaiser TM

PROVIDER: S-EPMC6601015 | biostudies-literature | 2019 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets.

Kaiser Thomas M TM Burger Pieter B PB

Molecules (Basel, Switzerland) 20190604 11

Machine learning continues to make strident advances in the prediction of desired properties concerning drug development. Problematically, the efficacy of machine learning in these arenas is reliant upon highly accurate and abundant data. These two limitations, high accuracy and abundance, are often taken together; however, insight into the dataset accuracy limitation of contemporary machine learning algorithms may yield insight into whether non-bench experimental sources of data may be used to ...[more]

PMID: 31167452

Similar Datasets

Project description:ObjectivesPost-stroke depression (PSD) is a common and serious psychiatric complication which hinders functional recovery and social participation of stroke patients. Stroke is characterized by dynamic changes in metabolism and hemodynamics, however, there is still a lack of metabolism-associated effective and reliable diagnostic markers and therapeutic targets for PSD. Our study was dedicated to the discovery of metabolism related diagnostic and therapeutic biomarkers for PSD.MethodsExpression profiles of GSE140275, GSE122709, and GSE180470 were obtained from GEO database. Differentially expressed genes (DEGs) were detected in GSE140275 and GSE122709. Functional enrichment analysis was performed for DEGs in GSE140275. Weighted gene co-expression network analysis (WGCNA) was constructed in GSE122709 to identify key module genes. Moreover, correlation analysis was performed to obtain metabolism related genes. Interaction analysis of key module genes, metabolism related genes, and DEGs in GSE122709 was performed to obtain candidate hub genes. Two machine learning algorithms, least absolute shrinkage and selection operator (LASSO) and random forest, were used to identify signature genes. Expression of signature genes was validated in GSE140275, GSE122709, and GSE180470. Gene set enrichment analysis (GSEA) was applied on signature genes. Based on signature genes, a nomogram model was constructed in our PSD cohort (27 PSD patients vs. 54 controls). ROC curves were performed for the estimation of its diagnostic value. Finally, correlation analysis between expression of signature genes and several clinical traits was performed.ResultsFunctional enrichment analysis indicated that DEGs in GSE140275 enriched in metabolism pathway. A total of 8,188 metabolism associated genes were identified by correlation analysis. WGCNA analysis was constructed to obtain 3,471 key module genes. A total of 557 candidate hub genes were identified by interaction analysis. Furthermore, two signature genes (SDHD and FERMT3) were selected using LASSO and random forest analysis. GSEA analysis found that two signature genes had major roles in depression. Subsequently, PSD cohort was collected for constructing a PSD diagnosis. Nomogram model showed good reliability and validity. AUC values of receiver operating characteristic (ROC) curve of SDHD and FERMT3 were 0.896 and 0.964. ROC curves showed that two signature genes played a significant role in diagnosis of PSD. Correlation analysis found that SDHD (r = 0.653, P < 0.001) and FERM3 (r = 0.728, P < 0.001) were positively related to the Hamilton Depression Rating Scale 17-item (HAMD) score.ConclusionA total of 557 metabolism associated candidate hub genes were obtained by interaction with DEGs in GSE122709, key modules genes, and metabolism related genes. Based on machine learning algorithms, two signature genes (SDHD and FERMT3) were identified, they were proved to be valuable therapeutic and diagnostic biomarkers for PSD. Early diagnosis and prevention of PSD were made possible by our findings.

Project description:BackgroundHere, we outline a method of applying existing machine learning (ML) approaches to aid citation screening in an on-going broad and shallow systematic review of preclinical animal studies. The aim is to achieve a high-performing algorithm comparable to human screening that can reduce human resources required for carrying out this step of a systematic review.MethodsWe applied ML approaches to a broad systematic review of animal models of depression at the citation screening stage. We tested two independently developed ML approaches which used different classification models and feature sets. We recorded the performance of the ML approaches on an unseen validation set of papers using sensitivity, specificity and accuracy. We aimed to achieve 95% sensitivity and to maximise specificity. The classification model providing the most accurate predictions was applied to the remaining unseen records in the dataset and will be used in the next stage of the preclinical biomedical sciences systematic review. We used a cross-validation technique to assign ML inclusion likelihood scores to the human screened records, to identify potential errors made during the human screening process (error analysis).ResultsML approaches reached 98.7% sensitivity based on learning from a training set of 5749 records, with an inclusion prevalence of 13.2%. The highest level of specificity reached was 86%. Performance was assessed on an independent validation dataset. Human errors in the training and validation sets were successfully identified using the assigned inclusion likelihood from the ML model to highlight discrepancies. Training the ML algorithm on the corrected dataset improved the specificity of the algorithm without compromising sensitivity. Error analysis correction leads to a 3% improvement in sensitivity and specificity, which increases precision and accuracy of the ML algorithm.ConclusionsThis work has confirmed the performance and application of ML algorithms for screening in systematic reviews of preclinical animal studies. It has highlighted the novel use of ML algorithms to identify human error. This needs to be confirmed in other reviews with different inclusion prevalence levels, but represents a promising approach to integrating human decisions and automation in systematic review methodology.

Project description:Background:Epilepsy is a disorder that can manifest as abnormalities in neurological or physical function. Stress cardiomyopathy is closely associated with neurological stimulation. However, the mechanisms underlying the interrelationship between epilepsy and stress cardiomyopathy are unclear. This paper aims to explore the genetic features and potential molecular mechanisms shared in epilepsy and stress cardiomyopathy. Methods:By analyzing the epilepsy dataset and stress cardiomyopathy dataset separately, the intersection of the two disease co-expressed differential genes is obtained, the co-expressed differential genes reveal the biological functions, the network is constructed, and the core modules are identified to reveal the interaction mechanism, the co-expressed genes with diagnostic validity are screened by machine learning algorithms, and the co-expressed genes are validated in parallel on the epilepsy single-cell data and the stress cardiomyopathy rat model. Results: Epilepsy causes stress cardiomyopathy, and its key pathways are Complement and coagulation cascades, HIF-1 signaling pathway, its key co-expressed genes include SPOCK2, CTSZ, HLA-DMB, ALDOA, SFRP1, ERBB3.The key immune cell subpopulations localized by single-cell data are the T_cells subgroup, Microglia subgroup, Macrophage subgroup, Astrocyte subgroup, and Oligodendrocytes subgroup. Conclusion: We believe epilepsy causing stress cardiomyopathy results from a multi-gene, multi-pathway combination. We identified the core co-expressed genes (SPOCK2, CTSZ, HLA-DMB, ALDOA, SFRP1, ERBB3) and the pathways that function in them (Complement and coagulation cascades, HIF-1 signaling pathway,JAK-STAT signaling pathway), and finally localized their key cellular subgroups(T_cells subgroup, Microglia subgroup, Macrophage subgroup, Astrocyte subgroup,and Oligodendrocytes subgroup). Also, combining cell subpopulations with hypercoagulability as well as sympathetic excitation further narrowed the cell subpopulations of related functions.

Dataset Information

Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets.

Publications

Error Tolerance of Machine Learning Algorithms across Contemporary Biological Targets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets