Machine learning for discovery: deciphering RNA splicing logic
Ontology highlight
ABSTRACT: Machine learning methods, particularly neural networks trained on large datasets, are transforming how scientists approach scientific discovery and experimental design. However, current state-of-the-art neural networks are limited by their uninterpretability: despite providing accurate predictions, they cannot describe how they arrived at their predictions. Here, using an ``interpretable-by-design'' approach, we present a neural network model that provides insights into RNA splicing, a fundamental process in the transfer of genomic information into functional biochemical products. Although we designed our model to emphasize interpretability, its predictive accuracy is on par with state-of-the-art models. To demonstrate the model's interpretability, we introduce a visualization that, for any given exon, allows us to trace and quantify the entire decision process from input sequence to output splicing prediction. Importantly, the model revealed novel components of the splicing logic, which we experimentally validated. This study highlights how interpretable machine learning can advance scientific discovery.
Project description:The incorporation of machine learning methods into proteomics workflows improves the identification of disease-relevant biomarkers and biological pathways. However, machine learning models, such as deep neural networks, typically suffer from lack of interpretability. Here, we present a deep learning approach to combine biological pathway analysis and biomarker identification to increase the interpretability of proteomics experiments. Our approach integrates a priori knowledge of the relationships between proteins and biological pathways and biological processes into sparse neural networks to create biologically informed neural networks. We employ these networks to differentiate between clinical subphenotypes of septic acute kidney injury and COVID-19, as well as acute respiratory distress syndrome of different aetiologies. To gain biological insight into the complex syndromes, we utilize feature attribution-methods to introspect the networks for the identification of proteins and pathways important for distinguishing between subtypes. The algorithms are implemented in a freely available open source Python-package (https://github.com/InfectionMedicineProteomics/BINN).
Project description:With the emergence of various drug-based treatment strategies, many drug response prediction models have been developed to understand their effects. However, in order to gain comprehensive understanding of a drug response, a prediction model should reflect the underlying biological mechanisms, but the current models suffer from interpretability and scalability problems. Machine learning-based prediction models base their predictions on inferred features, which usually are not well correlated with biological mechanisms, posing a challenge on interpretability of its response predictions. In this regard, using Boolean modeling schemes may allow interpretations on mechanisms that contribute to a particular response, but optimizing Boolean models is difficult because of their high dimensional search space and discontinuous loss function. Here, we developed a scalable derivative-free optimizer for weighted sum Boolean network through meta-reinforcement learning. By using graph network and coordinate-wise policy, our learned optimizer can optimize high dimensional Boolean networks containing over 100 parameters of arbitrary structure, showing higher sample efficiency compared with other meta-heuristic algorithms. The optimized Boolean networks successfully predict the drug responses congruent with public databases and in-house experimental data. Moreover, mechanistic analysis of optimized networks shows reliable interpretability of the predictions by meaningful suggestions of known basket trial drug response prediction markers.
Project description:High-throughput screening and gene signature analyses frequently identify lead therapeutic compounds with unknown modes of action (MoAs), and the resulting uncertainties can lead to the failure of clinical trials. We developed a multi-omics approach for uncovering MoAs through an interpretable machine learning model of the effects of compounds on transcriptomic, epigenomic, metabolomic, and proteomic data. We applied this approach to examine compounds with beneficial effects in models of Huntington’s disease, finding common MoAs for previously unrelated compounds that were not predicted based on similarities in the compounds’ structures, connectivity scores, or binding targets. We experimentally validated two such disease-relevant MoAs, autophagy activation and bioenergetics manipulation. This interpretable machine learning approach can be used to find and evaluate MoAs in future drug development efforts.
Project description:High-throughput screening and gene signature analyses frequently identify lead therapeutic compounds with unknown modes of action (MoAs), and the resulting uncertainties can lead to the failure of clinical trials. We developed a multi-omics approach for uncovering MoAs through an interpretable machine learning model of the effects of compounds on transcriptomic, epigenomic, metabolomic, and proteomic data. We applied this approach to examine compounds with beneficial effects in models of Huntington’s disease, finding common MoAs for previously unrelated compounds that were not predicted based on similarities in the compounds’ structures, connectivity scores, or binding targets. We experimentally validated two such disease-relevant MoAs, autophagy activation and bioenergetics manipulation. This interpretable machine learning approach can be used to find and evaluate MoAs in future drug development efforts.
Project description:To identify genes with cell-lineage-specific expression not accessible by experimental micro-dissection, we developed a genome-scale iterative method, in-silico nano-dissection, which leverages high-throughput functional-genomics data from tissue homogenates using a machine-learning framework. This study applied nano-dissection to chronic kidney disease and identified transcripts specific to podocytes, key cells in the glomerular filter responsible for hereditary proteinuric syndromes and acquired CKD. In-silico prediction accuracy exceeded predictions derived from fluorescence-tagged-murine podocytes, identified genes recently implicated in hereditary glomerular disease and predicted genes significantly correlated with kidney function. The nano-dissection method is broadly applicable to define lineage specificity in many functional and disease contexts. We applied a machine-learning framework on high-throughput gene expression data from human kidney biopsy tissue homogenates and predict novel podocyte-specific genes. The prediction was validated by Human Protein Atlas at protein level. Prediction accuracy was compared with predictions derived from experimental approach using fluorescence-tagged-murine podocytes.
Project description:To identify genes with cell-lineage-specific expression not accessible by experimental micro-dissection, we developed a genome-scale iterative method, in-silico nano-dissection, which leverages high-throughput functional-genomics data from tissue homogenates using a machine-learning framework. This study applied nano-dissection to chronic kidney disease and identified transcripts specific to podocytes, key cells in the glomerular filter responsible for hereditary proteinuric syndromes and acquired CKD. In-silico prediction accuracy exceeded predictions derived from fluorescence-tagged-murine podocytes, identified genes recently implicated in hereditary glomerular disease and predicted genes significantly correlated with kidney function. The nano-dissection method is broadly applicable to define lineage specificity in many functional and disease contexts. We applied a machine-learning framework on high-throughput gene expression data from human kidney biopsy tissue homogenates and predict novel podocyte-specific genes. The prediction was validated by Human Protein Atlas at protein level. Prediction accuracy was compared with predictions derived from experimental approach using fluorescence-tagged-murine podocytes.