Project description:The human K562 chronic myeloid leukemia cell line has long served as an experimental paradigm for functional genomic studies. To systematically and functionally annotate the human genome, the ENCODE consortium generated hundreds of functional genomic data sets, such as chromatin immunoprecipitation coupled to sequencing (ChIP-seq). While ChIP-seq analyses have provided tremendous insights into gene regulation, spatiotemporal insights were limited by a resolution of several hundred base pairs. ChIP-exonuclease (ChIP-exo) is a refined version of ChIP-seq that overcomes this limitation by providing higher precision mapping of protein-DNA interactions. To study the interplay of transcription initiation and chromatin, we profiled the genome-wide locations for RNA polymerase II (Pol II), the histone variant H2A.Z, and the histone modification H3K4me3 using ChIP-seq and ChIP-exo. In this Data Descriptor, we present detailed information on parallel experimental design, data generation, quality control analysis, and data validation. We discuss how these data lay the foundation for future analysis to understand the relationship between the occupancy of Pol II and nucleosome positions at near base pair resolution.
Project description:While a role of promoter-proximal RNA Polymerase II (Pol II) pausing in regulation of eukaryotic gene expression is implied, the mechanisms and dynamics of this process are poorly understood. We performed genome-wide analysis of short capped RNAs (scRNAs) and Pol II chromatin immunoprecipitation sequencing (ChIP-seq) in human breast cancer MCF-7 cells to better understand Pol II pausing (Samarakkody, A., Abbas, A., Scheidegger, A., Warns, J., Nnoli, O., Jokinen, B., Zarns, K., Kubat, B., Dhasarathy, A. and Nechaev, S. (2015) RNA polymerase II pausing can be retained or acquired during activation of genes involved in the epithelial to mesenchymal transition. Nucleic Acids Res43, 3938-3949). The data are available at the NCBI Gene Expression Omnibus under accession number GSE67041. For both ChIP and scRNA samples, we used paired end sequencing on the Illumina MiSeq instrument. For ChIP-seq, the use of paired end sequencing allowed us to avoid ambiguities in center-read definition. For scRNA seq, this allowed us to identify both the 5'-end and the 3'-end in the same run that represent, respectively, the transcription start sites and the locations of Pol II pausing. The sharpening of Pol II ChIP-seqmetagene profiles when aligned against 5'-ends of scRNAs indicates that these RNAs can be used to define the start sites for the majority of mRNA transcription events.
Project description:Use of alternative gene promoters that drive widespread cell-type, tissue-type or developmental gene regulation in mammalian genomes is a common phenomenon. Chromatin immunoprecipitation methods coupled with DNA microarray (ChIP-chip) or massive parallel sequencing (ChIP-seq) are enabling genome-wide identification of active promoters in different cellular conditions using antibodies against Pol-II. However, these methods produce enrichment not only near the gene promoters but also inside the genes and other genomic regions due to the non-specificity of the antibodies used in ChIP. Further, the use of these methods is limited by their high cost and strong dependence on cellular type and context.We trained and tested different state-of-art ensemble and meta classification methods for identification of Pol-II enriched promoter and Pol-II enriched non-promoter sequences, each of length 500 bp. The classification models were trained and tested on a bench-mark dataset, using a set of 39 different feature variables that are based on chromatin modification signatures and various DNA sequence features. The best performing model was applied on seven published ChIP-seq Pol-II datasets to provide genome wide annotation of mouse gene promoters.We present a novel algorithm based on supervised learning methods to discriminate promoter associated Pol-II enrichment from enrichment elsewhere in the genome in ChIP-chip/seq profiles. We accumulated a dataset of 11,773 promoter and 46,167 non-promoter sequences, each of length 500 bp, generated from RNA Pol-II ChIP-seq data of five tissues (Brain, Kidney, Liver, Lung and Spleen). We evaluated the classification models in building the best predictor and found that Bagging and Random Forest based approaches give the best accuracy. We implemented the algorithm on seven different published ChIP-seq datasets to provide a comprehensive set of promoter annotations for both protein-coding and non-coding genes in the mouse genome. The resulting annotations contain 13,413 (4,747) protein-coding (non-coding) genes with single promoters and 9,929 (1,858) protein-coding (non-coding) genes with two or more alternative promoters, and a significant number of unassigned novel promoters.Our new algorithm can successfully predict the promoters from the genome wide profile of Pol-II bound regions. In addition, our algorithm performs significantly better than existing promoter prediction methods and can be applied for genome-wide predictions of Pol-II promoters.
Project description:Alternative promoters that are differentially used in various cellular contexts and tissue types add to the transcriptional complexity in mammalian genome. Identification of alternative promoters and the annotation of their activity in different tissues is one of the major challenges in understanding the transcriptional regulation of the mammalian genes and their isoforms. To determine the use of alternative promoters in different tissues, we performed ChIP-seq experiments using antibody against RNA Pol-II, in five adult mouse tissues (brain, liver, lung, spleen and kidney). Our analysis identified 38 639 Pol-II promoters, including 12 270 novel promoters, for both protein coding and non-coding mouse genes. Of these, 6384 promoters are tissue specific which are CpG poor and we find that only 34% of the novel promoters are located in CpG-rich regions, suggesting that novel promoters are mostly tissue specific. By identifying the Pol-II bound promoter(s) of each annotated gene in a given tissue, we found that 37% of the protein coding genes use alternative promoters in the five mouse tissues. The promoter annotations and ChIP-seq data presented here will aid ongoing efforts of characterizing gene regulatory regions in mammalian genomes.
Project description:RNA Polymerase II transcribes protein-coding and many non-coding RNA genes in eukaryotes. The largest subunit of RNA Polymerase II, Rpb1, contains a hepta-peptide repeat on its C-terminal tail with three potential phosphorylation sites (Serine 2, Serine 5 and Serine 7). Mammalian Rpb1 contains 52 repeats. The phosphorylation events are catalyzed by specific protein kinases where the phosphorylation of specific residues is coupled to the transcription cycle. For example, the Cdk7 subunit of TFIIH phosphorylates both Serine 5 and Serine 7 during intiation and the Cdk9 subunit of P-TEFb phosphorylates Serine 2 during the transition into productive elongation. The dataset presented here is the genome-wide distribution of RNA Pol II with Serine 7 of the CTD phosphorylated in murine embryonic stem cells. This data, in addition to phospho-specific datasets generated in the same cell type in Rahl et al. Cell 2010 and Seila et al. Science 2008, represents the genome-wide distribution of multiple RNA Pol II isoforms in murine embryonic stem cells: total Pol II, hypophosphorylated CTD Pol II, Serine 2 phosphorylated CTD Pol II, Serine 5 phosphorylated CTD Pol II and Serine 7 phosphorylated CTD Pol II. An antibody specific to RNA Pol II Serine 7 phosphorylated CTD (gift of Dirk Eick; Chapman et al. Science 2008) was used to enrich for DNA fragments associated with this Pol II isoform in murine embryonic stem cells. DNA was purified and prepared for Illumina/Solexa sequencing following their standard protocol. This is a single dataset but together with datasets from Rahl et al. Cell 2010 and Seila et al. Science 2008, these datasets represent the genome-wide distribution of multiple RNA Pol II isoforms in murine embryonic stem cells: total Pol II, hypophosphorylated CTD Pol II, Serine 2 phosphorylated CTD Pol II, Serine 5 phosphorylated CTD Pol II and Serine 7 phosphorylated CTD Pol II.