DeepSTARR predicts enhancer activity from DNA sequence and enables the de novo design of synthetic enhancers [Drosophila genome-wide UMI-STARR-seq]
Ontology highlight
ABSTRACT: Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood and enhancer de novo design is considered impossible. Here we built a deep learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally non-equivalent instances of the same TF motif that are determined by motif-flanking sequence and inter-motif distances. We validated these rules experimentally and demonstrated their conservation in human by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Project description:Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood and enhancer de novo design is considered impossible. Here we built a deep learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally non-equivalent instances of the same TF motif that are determined by motif-flanking sequence and inter-motif distances. We validated these rules experimentally and demonstrated their conservation in human by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Project description:Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood and enhancer de novo design is considered impossible. Here we built a deep learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally non-equivalent instances of the same TF motif that are determined by motif-flanking sequence and inter-motif distances. We validated these rules experimentally and demonstrated their conservation in human by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo.
Project description:Enhancer sequences control gene expression and comprise binding sites (motifs) for different transcription factors (TFs). Despite extensive genetic and computational studies, the relationship between DNA sequence and regulatory activity is poorly understood and enhancer de novo design is considered impossible. Here we built a deep learning model, DeepSTARR, to quantitatively predict the activities of thousands of developmental and housekeeping enhancers directly from DNA sequence in Drosophila melanogaster S2 cells. The model learned relevant TF motifs and higher-order syntax rules, including functionally non-equivalent instances of the same TF motif that are determined by motif-flanking sequence and inter-motif distances. We validated these rules experimentally and demonstrated their conservation in human by testing more than 40,000 wildtype and mutant Drosophila and human enhancers. Finally, we designed and functionally validated synthetic enhancers with desired activities de novo. This SuperSeries is composed of the SubSeries listed below.
Project description:The information about when and where each gene is to be expressed is mainly encoded in the DNA sequence of enhancers, sequence elements that comprise binding sites (motifs) for different transcription factors (TFs). Most of the research on enhancer sequences has been focused on TF motif presence, while the enhancer syntax, i.e. the flexibility of important motif positions and how the sequence context modulates the activity of TF motifs, remain poorly understood. Here, we explore the rules of enhancer syntax by a two-pronged approach in Drosophila melanogaster S2 cells: we (1) replace important motifs by an exhaustive set of all possible 65,536 eight-nucleotide-long random sequences and (2) paste eight important TF motif types into 763 motif positions within 496 enhancers. These complementary strategies reveal that enhancers display constrained sequence flexibility and the context-specific modulation of motif function. Important motifs can be functionally replaced by hundreds of sequences constituting several distinct motif types, but only a fraction of all possible sequences and motif types restore enhancer activity. Moreover, TF motifs contribute with different intrinsic strengths that are strongly modulated by the enhancer sequence context (the flanking sequence, presence and diversity of other motif types, and distance between motifs), such that not all motif types can work in all positions. Constrained sequence flexibility and the context-specific modulation of motif function are also hallmarks of human enhancers and TF motifs, as we demonstrate experimentally. Overall, these two general principles of enhancer sequences are important to understand and predict enhancer function during development, evolution and in disease.
Project description:The information about when and where each gene is to be expressed is mainly encoded in the DNA sequence of enhancers, sequence elements that comprise binding sites (motifs) for different transcription factors (TFs). Most of the research on enhancer sequences has been focused on TF motif presence, while the enhancer syntax, i.e. the flexibility of important motif positions and how the sequence context modulates the activity of TF motifs, remain poorly understood. Here, we explore the rules of enhancer syntax by a two-pronged approach in Drosophila melanogaster S2 cells: we (1) replace important motifs by an exhaustive set of all possible 65,536 eight-nucleotide-long random sequences and (2) paste eight important TF motif types into 763 motif positions within 496 enhancers. These complementary strategies reveal that enhancers display constrained sequence flexibility and the context-specific modulation of motif function. Important motifs can be functionally replaced by hundreds of sequences constituting several distinct motif types, but only a fraction of all possible sequences and motif types restore enhancer activity. Moreover, TF motifs contribute with different intrinsic strengths that are strongly modulated by the enhancer sequence context (the flanking sequence, presence and diversity of other motif types, and distance between motifs), such that not all motif types can work in all positions. Constrained sequence flexibility and the context-specific modulation of motif function are also hallmarks of human enhancers and TF motifs, as we demonstrate experimentally. Overall, these two general principles of enhancer sequences are important to understand and predict enhancer function during development, evolution and in disease.
Project description:The information about when and where each gene is to be expressed is mainly encoded in the DNA sequence of enhancers, sequence elements that comprise binding sites (motifs) for different transcription factors (TFs). Most of the research on enhancer sequences has been focused on TF motif presence, while the enhancer syntax, i.e. the flexibility of important motif positions and how the sequence context modulates the activity of TF motifs, remain poorly understood. Here, we explore the rules of enhancer syntax by a two-pronged approach in Drosophila melanogaster S2 cells: we (1) replace important motifs by an exhaustive set of all possible 65,536 eight-nucleotide-long random sequences and (2) paste eight important TF motif types into 763 motif positions within 496 enhancers. These complementary strategies reveal that enhancers display constrained sequence flexibility and the context-specific modulation of motif function. Important motifs can be functionally replaced by hundreds of sequences constituting several distinct motif types, but only a fraction of all possible sequences and motif types restore enhancer activity. Moreover, TF motifs contribute with different intrinsic strengths that are strongly modulated by the enhancer sequence context (the flanking sequence, presence and diversity of other motif types, and distance between motifs), such that not all motif types can work in all positions. Constrained sequence flexibility and the context-specific modulation of motif function are also hallmarks of human enhancers and TF motifs, as we demonstrate experimentally. Overall, these two general principles of enhancer sequences are important to understand and predict enhancer function during development, evolution and in disease.
Project description:Gene expression is determined by genomic elements called enhancers, which contain short motifs bound by different transcription factors (TFs). However, how enhancer sequences and TF motifs relate to enhancer activity is unknown and general sequence requirements for enhancers or comprehensive sets of important enhancer sequence elements have remained elusive. Here, we computationally dissect thousands of functional enhancer sequences from three different Drosophila cell lines. We find that the enhancers display distinct cis-regulatory sequence signatures, which are predictive of the enhancersM-bM-^@M-^Y cell type-specific or broad activities. These signatures contain transcription factor motifs and a novel class of enhancer sequence elements, dinucleotide repeat motifs (DRMs). DRMs are highly enriched in enhancers, particularly in enhancers that are broadly active across different cell types. We experimentally validate the importance of the identified TF motifs and DRMs for enhancer function and show that they can be sufficient to create an active enhancer de novo from non-functional sequence. The function of DRMs as a novel class of general enhancer features that are also enriched in human regulatory regions might explain their implication in several diseases and provides important insights into gene regulation. STARR-seq was performed in BG3 cells with paired-end sequencing in two replicates and respective inputs.
Project description:Gene expression is determined by genomic elements called enhancers, which contain short motifs bound by different transcription factors (TFs). However, how enhancer sequences and TF motifs relate to enhancer activity is unknown and general sequence requirements for enhancers or comprehensive sets of important enhancer sequence elements have remained elusive. Here, we computationally dissect thousands of functional enhancer sequences from three different Drosophila cell lines. We find that the enhancers display distinct cis-regulatory sequence signatures, which are predictive of the enhancers’ cell type-specific or broad activities. These signatures contain transcription factor motifs and a novel class of enhancer sequence elements, dinucleotide repeat motifs (DRMs). DRMs are highly enriched in enhancers, particularly in enhancers that are broadly active across different cell types. We experimentally validate the importance of the identified TF motifs and DRMs for enhancer function and show that they can be sufficient to create an active enhancer de novo from non-functional sequence. The function of DRMs as a novel class of general enhancer features that are also enriched in human regulatory regions might explain their implication in several diseases and provides important insights into gene regulation.
Project description:Genomic approaches have predicted hundreds of thousands of tissue specific cis-regulatory sequences, but the determinants critical to their function and evolutionary history are mostly unknown1-4. Here, we systematically decode a set of brain enhancers active in the zona limitans intrathalamica (zli), a signaling center essential for vertebrate forebrain development via the secreted morphogen, Sonic hedgehog (Shh)5,6. We apply a de novo motif analysis tool to identify six position-independent sequence motifs together with their cognate transcription factors that are essential for zli enhancer activity and Shh expression in the mouse embryo. Using knowledge of this regulatory lexicon, we discover novel Shh zli enhancers in mice, and a functionally equivalent element in hemichordates, indicating an ancient origin of the Shh zli regulatory network that predates the chordate phylum. These findings establish a paradigm for delineating functionally conserved enhancers in the absence of overt sequence homologies, and over extensive evolutionary distances. Gene expression profiles from the mouse zona limitans intrathalamica (ZLI) region at E10.5
Project description:Sequence-specific transcription factors (TFs) regulate gene expression by binding to cognate motifs in promoters and enhancers. However, predicting genomic TF binding events and their quantitative contribution to expression remains a major challenge. In principle, the binding and enhancer activity of specific sites in vivo might depend on: (i) latent properties of the motif instance, (ii) cooperative interactions with other TFs that bind in the immediate vicinity, and (iii) the chromatin state of the sites in the genome. Here, we used massively parallel reporter assays (MPRA) involving 32,115 natural and synthetic enhancers, together with high-throughput in vivo assays, to systematically dissect the contributions of motif affinity, cooperative interactions, and chromatin accessibility to the binding and regulatory activity of genomic sequences that contain motifs for PPARγ, a TF that serves as a key regulator of adipogenesis. We show that PPARγ binding and enhancer activity are governed by distinct features. Genomic PPARγ binding to motif sites is largely governed by on larger-scale features, such as chromatin accessibility, whereas the degree to which a PPARγ motif site enhances transcriptional activity depends on the sequence immediately surround the motif. We detect and functionally validate a network of TFs comprised of multiple functional classes that collaborate with PPARγ to drive transcription. We extensively perturb this network, revealing functional cooperativity among classes of TFs that does not depend on precise positioning. Together, these results present a clear picture of how chromatin and TFs from distinct functional classes interact with PPARγ to determine binding and enhancer activity, and provide a paradigm for studying any TF.