Sequence basis of transcription initiation in human genome
Ontology highlight
ABSTRACT: Transcription initiation is an essential process for ensuring proper function of any gene, however, we still lack a unified understanding of sequence patterns and rules that explains most transcription initiation sites in human genome. By explaining transcription initiation at basepair resolution from sequence with a deep learning-inspired explainable modeling approach, here we show that simple rules can explain the vast majority of human promoters. We identified key sequence patterns that contribute to human promoter function, each activating transcription with a distinct position-specific effect curve that likely reflects its mechanism of promoting transcription initiation. Most of these position-specific effects have not been previously characterized, and we verified them using experimental perturbations of transcription factor binding sequences. We revealed the sequence basis of bidirectional transcription at promoters and the links between promoter selectivity and gene expression variation across cell types. Additionally, by analyzing 241 mammalian genomes and mouse transcription initiation site data, we showed that the sequence determinants are conserved across mammalian species. Taken together, we provide a unified model for the sequence basis of transcription initiation at basepair resolution(?) that is broadly applicable across mammalian species, which sheds new light on fundamental questions related to promoter sequence and function.
Project description:DNA sequence signals in the core promoter, such as the initiator (Inr), direct transcription initiation by RNA polymerase II. Here we show that the human Inr has the consensus of BBCA+1BW at focused promoters in which transcription initiates at a single site or a narrow cluster of sites. The analysis of 7,678 focused transcription start sites revealed 40% with a perfect match to the Inr and 16% with a single mismatch outside of the CA+1 core. TATA-like sequences are underrepresented in Inr promoters. This consensus is a key component of the DNA sequence rules that specify transcription initiation in humans.
Project description:Transcription regulation occurs frequently through promoter-associated pausing of RNA polymerase II (Pol II). We developed a Precision nuclear Run-On and sequencing assay (PRO-seq) to map the genome-wide distribution of transcriptionally-engaged Pol II at base-pair resolution. Pol II accumulates immediately downstream of promoters, at intron-exon junctions that are efficiently used for splicing, and over 3' poly-adenylation sites. Focused analyses of promoters reveal that pausing is not fixed relative to initiation sites nor is it specified directly by the position of a particular core promoter element or the first nucleosome. Core promoter elements function beyond initiation, and when optimally positioned they act collectively to dictate the position and strength of pausing . We test this ‘Complex Interaction’ model with insertional mutagenesis of the Drosophila Hsp70 core promoter. Identification of RNA polymerase active sites in Drosophila S2 cell line using PRO-seq method. Identification of transcription initiation sites in Drosophila S2 cell line using PRO-cap method. Identification of changes in RNA polymerase active sites on transgenic Hsp70 promoters upon disruption of DNA sequence elements in 3 transgenic fly lines using PRO-seq method.
Project description:How DNA sequence affects the dynamics and position of RNA Polymerase II during transcription remains poorly understood. Here we used naturally occurring genetic variation in F1 hybrid mice to explore how DNA sequence differences affect the genome-wide distribution of Pol II. We measured the position and orientation of Pol II in eight organs collected from heterozygous F1 hybrid mice using ChRO-seq. Our data revealed a strong genetic basis for the precise coordinates of transcription initiation and promoter proximal pause, which was composed of both existing and novel DNA sequence motifs, allowing us to redefine molecular models of both core transcriptional processes. Our results implicate the strength of base pairing between A-T or G-C dinucleotides as key determinants to the position of Pol II initiation and pause. We reveal substantial and heritable differences in the position of transcription termination, which frequently do not affect the composition of the mature mRNA. Finally, we identified frequent, organ-specific changes in transcription that affect mRNA and ncRNA expression across broad genomic domains. Collectively, we reveal how DNA sequences shape core transcriptional processes at single nucleotide resolution in mammals.
Project description:Despite the conventional distinction between promoters and enhancers, they share many features in mammals, including divergent transcription and similar modes of transcription factor (TF) binding. Here, we examine the architecture of transcription initiation genome-wide through comprehensive mapping of transcription start sites (TSSs) in human lymphoblastoid B-cell (GM12878) and chronic myelogenous leukemic (K562) tier 1, ENCODE cell lines using a nuclear run-on protocol called GRO-cap. This method captures TSSs for both stable and unstable transcripts, thus allowing us to conduct detailed comparisons between thousands of promoters and enhancers in human cells. These analyses reveal a common architecture of initiation at both promoters and enhancers, including tightly spaced (110 bp) divergent initiation that features similar frequencies of core-promoter sequence elements, highly-positioned flanking nucleosomes, and two modes of TF binding. Transcript elongation stability, a feature determined after transcription initiation, provides a more fundamental distinction between promoters and enhancers than the relative abundance of histone modifications and the presence of TFs or coactivators. These results support a unified model of transcription initiation at both promoters and enhancers.
Project description:Despite the conventional distinction between promoters and enhancers, they share many features in mammals, including divergent transcription and similar modes of transcription factor (TF) binding. Here, we examine the architecture of transcription initiation genome-wide through comprehensive mapping of transcription start sites (TSSs) in human lymphoblastoid B-cell (GM12878) and chronic myelogenous leukemic (K562) tier 1, ENCODE cell lines using a nuclear run-on protocol called GRO-cap. This method captures TSSs for both stable and unstable transcripts, thus allowing us to conduct detailed comparisons between thousands of promoters and enhancers in human cells. These analyses reveal a common architecture of initiation at both promoters and enhancers, including tightly spaced (110 bp) divergent initiation that features similar frequencies of core-promoter sequence elements, highly-positioned flanking nucleosomes, and two modes of TF binding. Transcript elongation stability, a feature determined after transcription initiation, provides a more fundamental distinction between promoters and enhancers than the relative abundance of histone modifications and the presence of TFs or coactivators. These results support a unified model of transcription initiation at both promoters and enhancers.
Project description:Despite the conventional distinction between promoters and enhancers, they share many features in mammals, including divergent transcription and similar modes of transcription factor (TF) binding. Here, we examine the architecture of transcription initiation genome-wide through comprehensive mapping of transcription start sites (TSSs) in human lymphoblastoid B-cell (GM12878) and chronic myelogenous leukemic (K562) tier 1, ENCODE cell lines using a nuclear run-on protocol called GRO-cap. This method captures TSSs for both stable and unstable transcripts, thus allowing us to conduct detailed comparisons between thousands of promoters and enhancers in human cells. These analyses reveal a common architecture of initiation at both promoters and enhancers, including tightly spaced (110 bp) divergent initiation that features similar frequencies of core-promoter sequence elements, highly-positioned flanking nucleosomes, and two modes of TF binding. Transcript elongation stability, a feature determined after transcription initiation, provides a more fundamental distinction between promoters and enhancers than the relative abundance of histone modifications and the presence of TFs or coactivators. These results support a unified model of transcription initiation at both promoters and enhancers.
Project description:Transcription start site (TSS) selection is a key step in gene expression and occurs at many promoter positions over a wide range of efficiencies. Here, we develop a massively parallel reporter assay to quantitatively dissect contributions of promoter sequence, NTP substrate levels, and RNA polymerase II (Pol II) activity to TSS selection by "promoter scanning" in Saccharomyces cerevisiae (Pol II MAssively Systematic Transcript End Readout, "Pol II MASTER"). Using Pol II MASTER, we measure the efficiency of Pol II initiation at 1,000,000 individual TSS sequences in a defined promoter context. Pol II MASTER confirms proposed critical qualities of S. cerevisiae TSS -8, -1, and +1 positions quantitatively in a controlled promoter context. Pol II MASTER extends quantitative analysis to surrounding sequences and determines that they tune initiation over a wide range of efficiencies. These results enabled the development of a predictive model for initiation efficiency based on sequence. We show that genetic perturbation of Pol II catalytic activity alters initiation efficiency mostly independently of TSS sequence, but selectively modulates preference for initiating nucleotide. Intriguingly, we find that Pol II initiation efficiency is directly sensitive to GTP levels at the first five transcript positions and to CTP and UTP levels at the second position genome wide. These results suggest individual NTP levels can have transcript-specific effects on initiation, representing a cryptic layer of potential regulation at the level of Pol II biochemical properties. The results establish Pol II MASTER as a method for quantitative dissection of transcription initiation in eukaryotes.
Project description:RNA polymerase II (RNAPII) transcription converts the DNA sequence of a single gene into multiple transcript isoforms that may carry alternative functions. Gene isoforms result from variable transcription start sites (TSSs) at the beginning and polyadenylation sites (PASs) at the end of transcripts. How alternative TSSs relate to variable PASs is poorly understood. Here, we identify both ends of RNA molecules in Arabidopsis thaliana by transcription isoform sequencing (TIF-seq) and report four transcript isoforms per expressed gene. While intragenic initiation represents a large source of regulated isoform diversity, we observe that ~14% of expressed genes generate relatively unstable short promoter-proximal RNAs (sppRNAs) from nascent transcript cleavage and polyadenylation shortly after initiation. The location of sppRNAs correlates with the position of promoter-proximal RNAPII stalling, indicating that large pools of promoter-stalled RNAPII may engage in transcriptional termination. We propose that promoter-proximal RNAPII stalling-linked to premature transcriptional termination may represent a checkpoint that governs plant gene expression.
Project description:CRISPR interference (CRISPRi), the targeting of a catalytically dead Cas protein to block transcription, is the leading technique to silence gene expression in bacteria. However, design rules for CRISPRi remain poorly defined, limiting predictable design for gene interrogation, pathway manipulation, and high-throughput screens. Here we develop a best-in-class prediction algorithm for guide silencing efficiency by systematically investigating factors influencing guide depletion in multiple genome-wide essentiality screens, with the surprising discovery that gene-specific features such as transcriptional activity substantially impact prediction of guide activity. Accounting for these features as part of algorithm development allowed us to develop a mixed-effect random forest regression model that provides better estimates of guide efficiency than existing methods, as demonstrated in an independent saturating screen. We further applied methods from explainable AI to extract interpretable design rules from the model, such as sequence preferences in the vicinity of the PAM distinct from those previously described for genome engineering applications. Our approach provides a blueprint for the development of predictive models for CRISPR technologies where only indirect measurements of guide activity are available.
2023-12-19 | GSE196911 | GEO
Project description:XACT seq comprehensively defines the promoter position and promoter sequence determinants for initial transcription pausing