Structure primed embedding on the transcription factor manifold enables transparent model architectures for gene regulatory network and latent activity inference
Ontology highlight
ABSTRACT: Regulation of gene expression in biological systems is a complex, nonlinear process composed of context specific interactions, from signaling and transcription to genome modification. Modeling gene regulatory networks (GRNs) can be limited due to a lack of direct measurements of regulatory features in genome-wide screens. Most GRN inference methods are consequently forced to model covariance between regulatory genes and their targets as a proxy for causal interactions. This in turn complicates validation and reuse of predictive modeling frameworks. To disentangle covariance and casual influence require aggregation of independent and complementary sets of evidences, such as transcription factor (TF) binding and target gene expression. Common approaches include the overlap of evidence to infer causal relations. However the complete state of the system, e.g. TF activity (TFA) is unknown. Other methods tries to estimate these latent features. These models often use linear frameworks that are unable to account for non-linearities, TF-TF interactions, and other higher order features. Deep learning frameworks can be used to model complex interactions between features and capture latent features of higher order. However deep learning methods often discard central concepts in biological systems modeling such as sparsity and latent feature interpretability in favour of increased complexity of the model. In this work we demonstrate that gene regulatory network inference using latent features such as transcription factor activity can be built into a single framework. We present a novel deep learning approach (the Supirfactor framework) that incorporates multiple data-type orthogonal evidence of regulation and maintains interpretable parameter estimates.
Project description:Spatially resolved transcriptomics technologies have significantly enhanced our ability to understand cellular characteristics within tissue contexts. However, current analytical tools often treat cell type inference and cellular neighbourhood identification as separate and hard clustering processes, resulting in models that are not comparable across tissue feature scales and samples, thus hindering a unified understanding of tissue features. Our computational framework, SPARROW, addresses these challenges by representing cell types and cellular organization patterns as latent embeddings learned through an interconnected neural network architecture. SPARROW integrates clustering directly into the learning of these latent embeddings, enabling feature extraction specific to clustering while ensuring comparability across samples through shared latent spaces. When applied to diverse datasets, SPARROW outperformed state-of-the-art methods in cell type inference and microenvironment zone delineation and uncovered microenvironment zone-specific fine cell states that reveal underlying biology. Furthermore, SPARROW algorithmically achieves single cell spatial resolution and whole transcriptome coverage---an experimental challenge---by integrating spatially resolved transcriptomics and scRNA-seq data in a shared latent space. This formulation enabled SPARROW to uncover both established and novel microenvironment zone-specific ligand-receptor interactions in human tonsils---discoveries not possible with either data modality alone. Overall, SPARROW provides a comprehensive characterization of tissue features across scales, samples and conditions.
Project description:Sequence-based deep learning models have become the state of the art for the analysis of the genomic regulatory code. Particularly for transcriptional enhancers, deep learning models excel at deciphering sequence features and grammar that underlie their spatiotemporal activity. To enable end-to-end enhancer modeling and design, we developed a software and modeling package, called CREsted. It combines preprocessing starting from single-cell ATAC-seq data; modeling with a choice of several architectures for training classification and regression models on either topics or pseudobulk peak heights; sequence design using multiple strategies; and downstream analysis through a collection of tools to locate transcription factor (TF) binding sites, infer the effect of a TF (activating or repressing) on enhancer accessibility, decipher enhancer grammar, and score gene loci. We demonstrate CREsted using a mouse cortex model that we validate using the BICCN collection of in vivo validated mouse brain enhancers. Classical enhancers in immune cells, including the IFN-β enhanceosome are revisited using a PBMC model, and we assess the accuracy of TF binding site predictions with ChIP-seq. Additionally, we use CREsted to compare mesenchymal-like cancer cell states between tumor types; and we investigate different fine-tuning strategies of Borzoi within CREsted, comparing their performance and explainability with CREsted models trained from scratch. Finally, we train a CREsted model on a scATAC-seq atlas of zebrafish development, and use this to design and in vivo validate cell type-specific synthetic enhancers in 3 tissues. For varying datasets we demonstrate that CREsted facilitates efficient training and analyses, enabling scrutinization of the enhancer logic and design of synthetic enhancers across tissues and species. CREsted is available at https://crested.readthedocs.io.
Project description:Transcriptional regulatory networks (TRNs) provide insight into cellular behavior by describing interactions between transcription factors (TFs) and their gene targets. The Assay for Transposase Accessible Chromatin (ATAC)-seq, coupled with transcription-factor motif analysis, provides indirect evidence of chromatin binding for hundreds of TFs genome-wide. Here, we propose methods for TRN inference in a mammalian setting, using ATAC-seq data to influence gene expression modeling. We rigorously test our methods in the context of T Helper Cell Type 17 (Th17) differentiation, generating new ATAC-seq data to complement existing Th17 genomic resources (plentiful gene expression data, TF knock-outs and ChIP-seq experiments). In this resource-rich mammalian setting our extensive benchmarking provides quantitative, genome-scale evaluation of TRN inference combining ATAC-seq and RNA-seq data. We refine and extend our previous Th17 TRN, using our new TRN inference methods to integrate all Th17 data (gene expression, ATAC-seq, TF KO, ChIP-seq). We highlight new roles for individual TFs and groups of TFs (“TF-TF modules”) in Th17 gene regulation. Given the popularity of ATAC-seq (a widely adapted protocol with high resolution and low sample input requirements), we anticipate that application of our methods will improve TRN inference in new mammalian systems and be of particular use for rare, uncharacterized cell types.
Project description:Transcriptional regulatory networks (TRNs) provide insight into cellular behavior by describing interactions between transcription factors (TFs) and their gene targets. The Assay for Transposase Accessible Chromatin (ATAC)-seq, coupled with transcription-factor motif analysis, provides indirect evidence of chromatin binding for hundreds of TFs genome-wide. Here, we propose methods for TRN inference in a mammalian setting, using ATAC-seq data to influence gene expression modeling. We rigorously test our methods in the context of T Helper Cell Type 17 (Th17) differentiation, generating new ATAC-seq data to complement existing Th17 genomic resources (plentiful gene expression data, TF knock-outs and ChIP-seq experiments). In this resource-rich mammalian setting our extensive benchmarking provides quantitative, genome-scale evaluation of TRN inference combining ATAC-seq and RNA-seq data. We refine and extend our previous Th17 TRN, using our new TRN inference methods to integrate all Th17 data (gene expression, ATAC-seq, TF KO, ChIP-seq). We highlight new roles for individual TFs and groups of TFs (“TF-TF modules”) in Th17 gene regulation. Given the popularity of ATAC-seq (a widely adapted protocol with high resolution and low sample input requirements), we anticipate that application of our methods will improve TRN inference in new mammalian systems and be of particular use for rare, uncharacterized cell types.
Project description:The identification of cell-type-specific 3D chromatin interactions between regulatory elements can help to decipher gene regulation and to interpret the function of disease-associated non-coding variants. However, current chromosome conformation capture (3C) technologies are unable to resolve interactions at this resolution when only small numbers of cells are available as input. We therefore present ChromaFold, a deep learning model that predicts 3D contact maps and regulatory interactions from single-cell ATAC sequencing (scATAC-seq) data alone. ChromaFold uses pseudobulk chromatin accessibility, co-accessibility profiles across metacells, and predicted CTCF motif tracks as input features and employs a lightweight architecture to enable training on standard GPUs. Once trained on paired scATAC-seq and Hi-C data in human cell lines and tissues, ChromaFold can accurately predict both the 3D contact map and peak-level interactions across diverse human and mouse test cell types. In benchmarking against a recent deep learning method that uses bulk ATAC-seq, DNA sequence, and CTCF ChIP-seq to make cell-type-specific predictions, ChromaFold yields superior prediction performance when including CTCF ChIP-seq data as an input and comparable performance without. Finally, fine-tuning ChromaFold on paired scATAC-seq and Hi-C in a complex tissue enables deconvolution of chromatin interactions across cell subpopulations. ChromaFold thus achieves state-of-the-art prediction of 3D contact maps and regulatory interactions using scATAC-seq alone as input data, enabling accurate inference of celltype- specific interactions in settings where 3C-based assays are infeasible.
Project description:Here, we combine comparative regulatory genomics with machine learning to investigate enhancer logic in melanoma. Through epigenomics profiling of 26 melanoma cell lines across six species, we examine the conservation of the two main melanoma states and underlying master regulators. By training a deep neural network on topic models derived from the human lines, we were able to classify not only human melanoma enhancers, but also regulatory regions in the other species. The deep learning model revealed important genomic features (i.e. TF binding motifs) for the different melanoma states, how they co-occur within melanoma enhancers, and where they are placed with respect to the central enhancer nucleosome. This in-depth analysis of the melanoma enhancer code allowed us to propose a mechanistic model of TF binding in MEL melanoma enhancers. Finally, by exploiting the deep layers of our model, we are able to identify causal mutations for melanoma enhancer loss and gain through evolution, not only affecting enhancer accessibility but also activity.
Project description:Transcription factors (TFs) bind combinatorially to genomic cis-regulatory elements (cREs), orchestrating transcription programs. While studies of chromatin state and chromosomal interactions have revealed dynamic neurodevelopmental cRE landscapes, parallel understanding of the underlying TF binding lags. To elucidate the combinatorial TF-cRE interactions driving mouse basal ganglia development, we integrated ChIP-seq for twelve TFs, H3K4me3-associated enhancer-promoter interactions, chromatin and transcriptional state, and transgenic enhancer assays. We identified TF-cREs modules with distinct chromatin features and enhancer activity that have complementary roles driving GABAergic neurogenesis and suppressing other developmental fates. While the majority of distal cREs were bound by one or two TFs, a small proportion were extensively bound, and these enhancers also exhibited exceptional evolutionary conservation, motif density, and complex chromosomal interactions. Our results provide new insights into how modules of combinatorial TF-cRE interactions activate and repress developmental expression programs and demonstrate the value of TF binding data in modeling gene regulatory wiring.
Project description:Genetic regulatory networks (GRNs) regulate the flow of genetic information from the genome to expressed messenger RNAs (mRNAs) and thus are critical to controlling the phenotypic characteristics of cells. Numerous methods exist for profiling mRNA transcript levels and identifying protein-DNA binding interactions at the genome-wide scale. These enable researchers to determine the structure and output of transcriptional regulatory networks, but uncovering the complete structure and regulatory logic of GRNs remains a challenge. The field of GRN inference aims to meet this challenge using computational modeling to derive the structure and logic of GRNs from experimental data and to encode this knowledge in Boolean networks, Bayesian networks, ordinary differential equation (ODE) models, or other modeling frameworks. However, most existing models do not incorporate dynamic transcriptional data since it has historically been less widely available in comparison to “static” transcriptional data. We report the development of an evolutionary algorithm-based ODE modeling approach that integrates kinetic transcription data and the theory of attractor dynamics analysis to infer GRN architecture and regulatory logic. Our method outperformed six leading GRN inference methods, all of which do not incorporate kinetic transcriptional data in predicting regulatory connections among TFs when applied to a small-scale engineered synthetic GRN in S. cerevisiae. Moreover, we have shown the potential of our method to predict unknown transcription profiles that would be produced upon genetic perturbation of the GRN governing a two-state phenotypic switch in C. albicans. We established an iterative refinement strategy to facilitate candidate selection for experimentation and the experimental results in turn provide validation or improvement for the model. In this way, our GRN inference approach can expedite the development of a sophisticated mathematical model that accurately describes the structure and dynamics of the in vivo GRN.
Project description:Genetic regulatory networks (GRNs) regulate the flow of genetic information from the genome to expressed messenger RNAs (mRNAs) and thus are critical to controlling the phenotypic characteristics of cells. Numerous methods exist for profiling mRNA transcript levels and identifying protein-DNA binding interactions at the genome-wide scale. These enable researchers to determine the structure and output of transcriptional regulatory networks, but uncovering the complete structure and regulatory logic of GRNs remains a challenge. The field of GRN inference aims to meet this challenge using computational modeling to derive the structure and logic of GRNs from experimental data and to encode this knowledge in Boolean networks, Bayesian networks, ordinary differential equation (ODE) models, or other modeling frameworks. However, most existing models do not incorporate dynamic transcriptional data since it has historically been less widely available in comparison to “static” transcriptional data. We report the development of an evolutionary algorithm-based ODE modeling approach that integrates kinetic transcription data and the theory of attractor dynamics analysis to infer GRN architecture and regulatory logic. Our method outperformed six leading GRN inference methods, all of which do not incorporate kinetic transcriptional data in predicting regulatory connections among TFs when applied to a small-scale engineered synthetic GRN in S. cerevisiae. Moreover, we have shown the potential of our method to predict unknown transcription profiles that would be produced upon genetic perturbation of the GRN governing a two-state phenotypic switch in C. albicans. We established an iterative refinement strategy to facilitate candidate selection for experimentation and the experimental results in turn provide validation or improvement for the model. In this way, our GRN inference approach can expedite the development of a sophisticated mathematical model that accurately describes the structure and dynamics of the in vivo GRN.