Project description:Here, we combine comparative regulatory genomics with machine learning to investigate enhancer logic in melanoma. Through epigenomics profiling of 26 melanoma cell lines across six species, we examine the conservation of the two main melanoma states and underlying master regulators. By training a deep neural network on topic models derived from the human lines, we were able to classify not only human melanoma enhancers, but also regulatory regions in the other species. The deep learning model revealed important genomic features (i.e. TF binding motifs) for the different melanoma states, how they co-occur within melanoma enhancers, and where they are placed with respect to the central enhancer nucleosome. This in-depth analysis of the melanoma enhancer code allowed us to propose a mechanistic model of TF binding in MEL melanoma enhancers. Finally, by exploiting the deep layers of our model, we are able to identify causal mutations for melanoma enhancer loss and gain through evolution, not only affecting enhancer accessibility but also activity.
Project description:Deciphering the genomic regulatory code of enhancers is a key challenge in biology because this code underlies cellular identity. A better understanding of how enhancers work will improve the interpretation of noncoding genome variation and empower the generation of cell type-specific drivers for gene therapy. Here, we explore the combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the enhancer code in melanoma, a relevant case study owing to the presence of distinct melanoma cell states. We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data of 26 melanoma samples across six different species. We show the accuracy of DeepMEL predictions on the CAGI5 challenge, where it significantly outperforms existing models on the melanoma enhancer of IRF4 Next, we exploit DeepMEL to analyze enhancer architectures and identify accurate transcription factor binding sites for the core regulatory complexes in the two different melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related species, where sequence alignment fails, and the model highlights specific nucleotide substitutions that underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimize candidate enhancers and to prioritize enhancer mutations. In addition, our computational strategy can be applied to other cancer or normal cell types.
Project description:Machine learning methods, particularly neural networks trained on large datasets, are transforming how scientists approach scientific discovery and experimental design. However, current state-of-the-art neural networks are limited by their uninterpretability: despite providing accurate predictions, they cannot describe how they arrived at their predictions. Here, using an ``interpretable-by-design'' approach, we present a neural network model that provides insights into RNA splicing, a fundamental process in the transfer of genomic information into functional biochemical products. Although we designed our model to emphasize interpretability, its predictive accuracy is on par with state-of-the-art models. To demonstrate the model's interpretability, we introduce a visualization that, for any given exon, allows us to trace and quantify the entire decision process from input sequence to output splicing prediction. Importantly, the model revealed novel components of the splicing logic, which we experimentally validated. This study highlights how interpretable machine learning can advance scientific discovery.
Project description:Increasingly, experimental data on biological systems are obtained from several sources and computational approaches are required to integrate this information and derive models for the function of the system. Here, we demonstrate the power of a logic-based machine learning approach to propose hypotheses for gene function integrating information from two diverse experimental approaches. Specifically, we use inductive logic programming that automatically proposes hypotheses explaining the empirical data with respect to logically encoded background knowledge. We study the capsular polysaccharide biosynthetic pathway of the major human gastrointestinal pathogen Campylobacter jejuni. We consider several key steps in the formation of capsular polysaccharide consisting of 15 genes of which 8 have assigned function, and we explore the extent to which functions can be hypothesised for the remaining 7. Two sources of experimental data provide the information for learning-the results of knockout experiments on the genes involved in capsule formation and the absence/presence of capsule genes in a multitude of strains of different serotypes. The machine learning uses the pathway structure as background knowledge. We propose assignments of specific genes to five previously unassigned reaction steps. For four of these steps, there was an unambiguous optimal assignment of gene to reaction, and to the fifth, there were three candidate genes. Several of these assignments were consistent with additional experimental results. We therefore show that the logic-based methodology provides a robust strategy to integrate results from different experimental approaches and propose hypotheses for the behaviour of a biological system. [Data is also available from http://bugs.sgul.ac.uk/E-BUGS-132]
Project description:Leaf senescence is a tightly controlled and complex developmental process that shares many similarities across species, yet our understanding of the underlying conserved molecular mechanisms is still lacking. Here, we observed functional conservation of leaf senescence underlying pathways in A. thaliana, O. sativa, and S. lycopersicum. From machine learning-based integration of data from nearly 10 000 samples to obtain a universal regulatory network of leaf senescence, it was found that mitostasis is the cross-species central biological hub. We measure and compare changes in the transcriptome and metabolome of A. thaliana, O. sativa, and S. lycopersicum leaves under mitostress/natural senescence. In data from different species, mitostasis-related transcription factors binding site enrichment and amino acids expression changes converge on putative senescence modulators. Our study provides a cross-species, multi-omics perspective for understanding the leaf senescence conserved mechanisms.
Project description:Core regularity transcription factors (CR TFs) define cell identity and lineage through an exquisitely precise and logical order during embryogenesis and development. These CR TFs regulated one another in three-dimensional space via distal enhancers that serve as logic gates embedded in their TF recognition sequences. Aberrant chromatin organization resulting in miswired circuitry of enhancer logic is a newly recognized feature in many cancers. Here, we report that PAX3-FOXO1 expression is driven by a translocated FOXO1 distal super enhancer (SE). ChIP-seq in tumors bearing rare PAX translocations implicate enhancer miswiring is a pervasive feature across all FP-RMS tumors. Therefore, our data reveal a mechanism of a translocated hijacked enhancer which disrupts the normal CR TF logic during skeletal muscle development (PAX3 to MYOD to MYOG), replacing it with an infinite loop logic that makes rhabdomyosarcoma cells unable to exit the undifferentiated proliferating stage.
Project description:Combinations of transcription factors govern the identity of cell types, which is reflected by genomic enhancer codes. We utilized deep learning to characterize these enhancer codes and devised three novel metrics to compare cell types in the telencephalon between mammals and birds. To this end, we generated single-cell multiome and spatially-resolved transcriptomics data of the chicken telencephalon. Enhancer codes of orthologous non-neuronal and GABAergic cell types show a high degree of similarity across vertebrates, while excitatory neurons of the mammalian neocortex and avian pallium exhibit varying degrees of similarity. Enhancer codes of avian mesopallial neurons are most similar to those of mammalian deep layer neurons. With this study, we present generally applicable deep learning approaches to characterize and compare cell types solely based on genomic sequences.