Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

Small RNA-seq from ENCODE/Cold Spring Harbor Lab

ABSTRACT: This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Jonathan Preall jpreall@cshl.edu (Generation 0 Data from Hannon Lab), Carrie Davis davisc@cshl.edu (experimental), Alex Dobin dobin@cshl.edu (computational), Wei Lin wlin@cshl.edu (computational), Tom Gingeras gingeras@cshl.edu (primary investigator)). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). hg18: This data was produced by Hannon lab part of Cold Spring Harbor as part of the ENCODE Project. The series depicts NextGen sequencing information for RNAs between the sizes of 20-200 nt isolated from RNA samples from tissues or sub cellular compartments of cell lines. hg19: This track depicts NextGen sequencing information for RNAs between the sizes of 20-200 nt isolated from RNA samples from tissues or sub cellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome. hg19: This cloning protocol generates directional libraries that are read from the 5' ends of the inserts, which should largely correspond to the 5' ends of the mature RNAs. The libraries were sequenced on a Solexa platform for a total of 36, 50 or 76 cycles however the reads undergo post-processing resulting in trimming of their 3' ends. Consequently, the mapped read lengths are variable. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf hg18: Small RNAs between 20-200 nt were ribominus treated according to the manufacturer's protocol (Invitrogen) using custom LNA probes targeting ribosomal RNAs (some datasets are also depleted of U snRNAs and high abundant microRNAs). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5' cap structure. Poly-A Polymerase was used to catalyze the addition of C's to the 3' end. The 5' ends were phosphorylated using T4 PNK and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension and containing sequencing adapters. The library was sequenced on an Illumina GA machine for a total of 36, 50 or 76 cycles. Initially 1 lane is run. If an appreciable number of mappable reads are obtained, additional lanes are run. Sequence reads underwent quality filtration using Illumina standard pipeline (Gerlad). The read lengths may exceed the insert sizes and consequently introduce 3' adaptor sequence into the 3' end of the reads. The 3' sequencing adaptor was removed from the reads using a custom clipper program, which aligned the adaptor sequence to the short-reads, allowing up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (NCBI build 36, hg18 unmasked) using Nexalign (Lassmann et al., not published). The alignment parameters are tuned to tolerate up to 2 mismatches with no indels and will allow for trimmed portions as small as 5 nucleotides to be mapped. We report reads that mapped 10 or fewer times. Data obtained from each lane is processed and mapped independently. The processed/mapped data from each lane is then complied as a single track without additional processing and submitted to UCSC. Consequently, identical reads within a lane were collapsed and their value is reported as the "transfrag" signal value. However, the redundancy between lanes has not been eliminated so the same transfrag may appear multiple times within a signal. hg19: Small RNAs between 20-200 nt were ribominus treated according to the manufacturer's protocol (Invitrogen) using custom LNA probes targeting ribosomal RNAs (some datasets are also depleted of U snRNAs and high abundant microRNAs). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5' cap structures. Poly-A Polymerase was used to catalyze the addition of C's to the 3' end. The 5' ends were phosphorylated using T4 PNK and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension and containing sequencing adapters. The library was sequenced on an Illumina GA machine for a total of 36, 50 or 76 cycles. Initially, one lane was run. If an appreciable number of mappable reads were obtained, additional lanes were run. Sequence reads underwent quality filtration using Illumina standard pipeline (GERALD). The Illumina reads were initially trimmed to discard any bases following a quality score less than or equal to 20 and converted into FASTA format, thereby discarding quality information for the rest of the pipeline. As a result, the sequence quality scores in the BAM output are all displayed as "40" to indicate no quality information. The read lengths may exceed the insert sizes and consequently introduce 3' adapter sequence into the 3' end of the reads. The 3' sequencing adapter was removed from the reads using a custom clipper program (available at http://hannonlab.cshl.edu/fastx_toolkit/), which aligned the adapter sequence to the short-reads using up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. Terminal C nucleotides introduced at the 3' end of the RNA via the cloning procedure are also trimmed. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (version hg19, using the gender build appropriate to the sample in question - female/male) using Bowtie (Langmead B. et al). The alignment parameter allowed 0, 1, or 2 mismatches iteratively. We report reads that mapped 20 or fewer times. Discrepancies between hg18 and hg19 versions of CSHL small RNA data: The alignment pipeline for the CSHL small RNA data was updated upon the release of the human genome version hg19, resulting in a few noteworthy discrepancies with the hg18 dataset. First, mapping was conducted with the open-source Bowtie algorithm (http://bowtie-bio.sourceforge.net/index.shtml) rather than the custom NexAlign software. As each algorithm uses different strategies to perform alignments, the mapping results may vary even in genomic regions that do not differ between builds. The read processing pipeline also varies slightly, in that we no longer retain information regarding whether a read was 'clipped' off adapter sequence.

ORGANISM(S): Homo sapiens

SUBMITTER: UCSC ENCODE DCC

PROVIDER: E-GEOD-24565 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Publications

Landscape of transcription in human cells.

Djebali Sarah S Davis Carrie A CA Merkel Angelika A Dobin Alex A Lassmann Timo T Mortazavi Ali A Tanzer Andrea A Lagarde Julien J Lin Wei W Schlesinger Felix F Xue Chenghai C Marinov Georgi K GK Khatun Jainab J Williams Brian A BA Zaleski Chris C Rozowsky Joel J Röder Maik M Kokocinski Felix F Abdelhamid Rehab F RF Alioto Tyler T Antoshechkin Igor I Baer Michael T MT Bar Nadav S NS Batut Philippe P Bell Kimberly K Bell Ian I Chakrabortty Sudipto S Chen Xian X Chrast Jacqueline J Curado Joao J Derrien Thomas T Drenkow Jorg J Dumais Erica E Dumais Jacqueline J Duttagupta Radha R Falconnet Emilie E Fastuca Meagan M Fejes-Toth Kata K Ferreira Pedro P Foissac Sylvain S Fullwood Melissa J MJ Gao Hui H Gonzalez David D Gordon Assaf A Gunawardena Harsha H Howald Cedric C Jha Sonali S Johnson Rory R Kapranov Philipp P King Brandon B Kingswood Colin C Luo Oscar J OJ Park Eddie E Persaud Kimberly K Preall Jonathan B JB Ribeca Paolo P Risk Brian B Robyr Daniel D Sammeth Michael M Schaffer Lorian L See Lei-Hoon LH Shahab Atif A Skancke Jorgen J Suzuki Ana Maria AM Takahashi Hazuki H Tilgner Hagen H Trout Diane D Walters Nathalie N Wang Huaien H Wrobel John J Yu Yanbao Y Ruan Xiaoan X Hayashizaki Yoshihide Y Harrow Jennifer J Gerstein Mark M Hubbard Tim T Reymond Alexandre A Antonarakis Stylianos E SE Hannon Gregory G Giddings Morgan C MC Ruan Yijun Y Wold Barbara B Carninci Piero P Guigó Roderic R Gingeras Thomas R TR

Nature 20120901 7414

Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification ...[more]

PMID: 22955620

Publication: 1/3

Similar Datasets

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Florencia Pauli mailto:fpauli@hudsonalpha.org). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly (Mortazavi et al., 2008). Biological replicates of ENCODE cell lines were grown on separate culture plates, total RNA was purified and polyA selected two times. mRNA was then fragmented by magnesium-catalyzed hydrolysis, reverse transcribed to cDNA by random priming and amplified. The cDNA was sequenced on an Illumina Genome Analyzer (GAI or GAIIx). The DNA sequences were aligned to the NCBI Build37 (hg19) version of the human genome using the sequence alignment programs ELAND (Illumina) or Bowtie (Langmead et al., 2009). The first 10 residues of sequencing have a weak characteristic nucleotide bias of unknown origin. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. This is the first NCBI Build37 (hg19) release of this track (Jan 2012). This release includes the 3 datasets (Jurkat, A549/DEX100nm, and A549/EtOH2pct) previously released on NCBI Build36 (hg18) and adds data for several more cell types and growth conditions in replicate. Four types of download files are available for each replicate including the Raw Data (fastq), Transcripts GencodeV7 (gtf), Raw Signal (bigwig), and Alignments (bam). For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Experimental Procedures Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell) except for H1-hESC for which frozen cell pellets were purchased from Cellular Dynamics. Cells were lysed in RLT buffer (Qiagen RNEasy kit) and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNase digestion step to remove residual genomic DNA. mRNA was isolated from at least 10 ug of total RNA with oligo(dT) two times (Dynabeads mRNA PurificationgKit, Invitrogen). Alternatively, cells were lysed and mRNA was purified directly two times with oligo(dT) (Dynabeads mRNA DIRECT Kit, Invitrogen). 100 ng of mRNA was fragmented by magnesium-catalyzed hydrolysis and reverse transcribed to cDNA by random priming according to the protocol in Mortazavi et al. (2008). cDNA was prepared for sequencing on the Genome Analyzer flowcell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). The sequencing libraries were size-selected around 225 bp and amplified with 15 rounds of PCR. Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Single end reads of 36 nt in length were obtained. Data Processing and Analysis Fastq files were made from qseq files generated by the Illumina pipeline (Casava 1.7). The Raw Signal files (bigWig) were generated from bedgraph files and the score was calculated as the number of reads at that position divided by the total number of reads divided by one million. Casava export files were aligned to the NCBI Build37 (hg19) version of the human genome with ELAND (Illumina), generating SAM files. Fastq files of experiments that were previously aligned to NCBI Build36 (hg18) were aligned to NCBI Build37 (hg19) using Bowtie (Langmead et al., 2009; parameters: -S -n 2 -k 11 -m 10 --best), also generating SAM files. SAM files were converted to BAM with SAMtools (Li et al., 2009). Gene expression within Gencode.v7 (Harrow et al., 2006) gene models was estimated using Cufflinks v0.9.3 (Roberts et al., 2011). Estimates of transcript abundance were reported in Fragments Per Kilobase of exon per Million fragments mapped (FPKM). FPKM is calculated by dividing the total number of fragments that align to the gene model by the size of the spliced transcript (exons) in kilobases. This number is then divided by the total number of reads in millions for the experiment. FPKM is reported in the last column of the gtf (TranscriptGencV7) files. Raw Data (fastq), Raw Signal (bigWig), Alignments (bam) and Transcript Gencode V7 (gtf) files are available from the Downloads (http://hgwdev.cse.ucsc.edu/cgi-bin/hgFileUi?g=wgEncodeHaibRnaSeq) page.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (mailto:georgi@caltech.edu for data coordination/informatics/experimental questions, mailto:diane@caltech.edu for informatics questions, mailto:bawilli_91125@yahoo.com for experimental questions). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high throughput DNA sequencing, which was done here on an Illumina Genome Analyzer (GAI or GAIIx) (Mortazavi et al., 2008). The transcriptome measurements shown on these tracks were performed on polyA selected RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=longPolyA&type=rnaExtract) from total cellular RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=cell&type=localization) using two different protocols - one that preserves information about which strand the read is coming from and one that does not. Due to the specifics of the enzymology of library construction, gene and transcript quantification is more accurate based on the non-strand-specific protocol, while the strand-specific protocol is useful for assigning strandedness, but in general less reliable for quantification. Non-strand-specific protocol (deep "reference" transcriptome measurements, 2x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming and amplified. Data have been produced in two formats: single reads, each of which comes from one end of a cDNA molecule, and paired-end reads, which are obtained as pairs from both ends of cDNAs. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts. Strand specific protocol (1x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis. 3' adapters were ligated to the 3' end of fragments, then 5' adapters were ligated to the 5' end. The resulting RNA molecules were converted to cDNA and amplified. This RNA-seq protocol does specify the coding strand as each read is in the same 5'-3' orientation as the original RNA strand. As a result, loci where both strands are transcribed can be disambiguated. However, RNA ligation is an inherently biased process and as a result greater unevenness in sequence coverage across transcripts is observed compared to the non-strand-specific data, and quantification is less accurate. Data Analysis: Reads were aligned to the hg19 human reference genome using TopHat, a program specifically designed to align RNA-seq reads and discover splice junctions de novo. Cufflinks, a de novo transcript assembly and quantification software package, was run on the TopHat alignments to discover and quantify novel transcripts and to obtain transcript expression estimates based on the GENCODE annotation. All sequence files, alignments, gene and transcript models and expression estimates files are available for download. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Experimental Procedures: Cells were grown according to the approved ENCODE cell culture protocols except for H1-hESC for which frozen cell pellets were purchased from Cellular Dynamics. Cells were lysed in RLT buffer (Qiagen RNEasy kit) and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNAse digestion step to remove residual genomic DNA. 75 µgs of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. For 2x75 bp non-stranded RNA-seq, 100 ngs of mRNA was then processed according to the protocol in Mortazavi et al (2008), and prepared for sequencing on the Genome Analyzer flow cell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). The majority of paired-end libraries were size-selected around 200 bp (fragment length) with the exception of a few additional replicates that were size-selected at 400 bp with the specific intent to investigate the effect of fragment length on results. Strand-specific RNA-seq libraries were prepared from 100ng of mRNA from the same preparation following Illumina's Strand-Specific RNA-seq protocol . Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Reads of 75 bp length were obtained, single end for directional, strand-specific libraries (1x75D) and paired end for non-strand-specific libraries (2x75). Data Processing and Analysis: Reads were mapped to the reference human genome (version hg19), with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes and haplotypes in all cases, using TopHat (version 1.0.14). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance. After mapping reads to the genome and identifying splice junctions, the data was further analyzed using the transcript assembly and quantification software Cufflinks (version 0.9.3) using the sequence bias detection and correction option. Cufflinks was used in two modes: first, expression for genes and individual transcripts was quantified based on the GENCODE annotation, for both versions v3c and v4 of GENCODE GRCh37, and second, Cufflinks was run in de novo transcript assembly and quantification mode to obtain candidate novel transcript and gene models and expression estimates for them.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Florencia Pauli mailto:fpauli@hudsonalpha.org). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). The ChIP-Seq method was used to assay chromatin fragments bound by specific or general transcription factors as described below. DNA isolated by ChIP-Seq was size-selected (~225 bp) and sequenced. Short reads of 25-36 bp were mapped to the human reference genome, and enriched regions of high read density relative to a total input chromatin control reads were identified. The sequence reads with quality scores (fastq files) and alignment coordinates (BAM files) from these experiments are available for download. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell). Cross-linked chromatin was immunoprecipitated with an antibody. The Protein:DNA crosslinks were then reversed and the DNA fragments were recovered and sequenced. Please see protocol notes below and go to http://hudsonalpha.org/myers-lab/protocols for the most current version of the protocol. Biological replicates from each experiment were completed. Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Sequence data produced by the Illumina data pipeline software were quality filtered and then mapped to NCBI Build37 (hg19) using the integrated Eland software; 32 nt of the sequence reads were used for alignment; up to two mismatches were tolerated; reads that mapped to multiple sites in the genome were discarded. To identify likely binding sites, peak calling was applied to the aligned sequence data sets using Model-based Analysis of Chip-Seq MACS (Zhang Y, et al., 2008) (http://liulab.dfci.harvard.edu/MACS/00README.html). MACS models the shift size of ChIP-Seq tags empirically, and uses it to improve the spatial resolution of predicted binding sites. MACS also uses a dynamic Poisson distribution to capture local biases in the genome, allowing for more robust predictions (Zhang Y, et al., 2008). Protocol Notes: Several changes and improvements were made to the original ChIP-Seq protocol (Jonshon et al.,2008). The major differences between protocols are the number of cells and magnetic beads used for IP, the method of sonication used to fragment DNA, and the number of cycles of PCR used to amplify the sequencing library. The most current protocol used by the Myers lab can be found at http://hudsonalpha.org/myers-lab/protocols. The protocol field for each file denotes the version of the protocol used as being PCR1x, PCR2x or a version number (for examples, v041610.1). The sequencing libraries labeled as PCR2x were made with two rounds of amplification (25 and 15 cycles) and those labeled as PCR1x were made with one 15-cycle round of amplification. These experiments were completed prior to January 2010 and were originally aligned to NCBI Build36 (hg18). They have been re-aligned to NCBI Build37 (hg19) with the Bowtie software (Langmead, et al., 2009) for this data release (http://bowtie-bio.sourceforge.net/index.shtml). The libraries labeled with a protocol version number were competed after January 2010 and were only aligned to NCBI Build37 (hg19). Please refer to the Myers Lab website (http://hudsonalpha.org/myers-lab/protocols) for details on each protocol version. Verification: The MACS (http://liulab.dfci.harvard.edu/MACS/00README.html) peak caller was used to call significant peaks on the individual replicates of a ChIP-Seq experiment. Afterwards, the irreproducible discovery rate (IDR) method, developed by Li et al. (submitted), was used to quantify the consistency between pairs of ranked peaks lists from replicates. The IDR methods uses a model that assumes that the ranked lists of peaks in a pair of replicates consist of two groups - a reproducible group and an irreproducible group. In general, the signals in the reproducible group are more consistent (i.e. with a larger rank correlation coefficient) and are ranked higher than the irreproducible group. The proportion of peaks that belong to the irreproducible component and the correlation of the reproducible component are estimated adaptively from the data. The model also provides an IDR score for each peak, which reflects the posterior probability of the peak belonging to the irreproducible group. The aligned reads were pooled from all replicates and the MACS peak caller was used to call significant peaks on the pooled data. Only datasets containing at least 100 peaks passing the IDR threshold are considered valid and submitted for release.

Project description:This track depicts high throughput sequencing of long RNAs (>200 nt) from RNA samples from tissues or subcellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Sample preparation and sequencing: K562 and GM12878 total cell, total RNA: Standard Illumina Pair-end kit with the sole exception that a "tagged" random hexamer was used to prime the 1st strand synthesis: 5'-ACTGTAGGN6-3'. The addition of this tag is what permits us to make strand assignments for the reads. The sequence of the tag is reported in the 5' end of the read. Asymmetric PCR can place the tag on either the 1st or 2nd read depending on which strand it used as a template. Strand assignments are made by looking for the tag at the 5' end of either read 1 or read 2. Read 1 is physically linked to read 2. Therefore, if a tag is present on one end strand assignments are made for both ends. We noted during analysis that the tags are generally 5' truncated. We only "strand" reads that contain ACTGTAGG, CTGTAGG, TGTAGG, GTAGG. Between 63-68% of reads could be stranded in these libraries. It is possible to cull additional stranded reads that contain non-templated TAGG, AGG, GG, or G sequences at their 5' end. The peak in insert size distribution is between 200-250 nucleotides. K562 cytosol, polyA+ RNA: Oligo-dT selected poly-A+ RNA was RiboMinus-treated according to the manufacturer's protocol (Invitrogen). The RNA was treated with tobacco alkaline pyrophosphatase to eliminate any 5' cap structures and hydrolyzed to ~200 bases via alkaline hydrolysis. The 3' end was repaired using calf intestinal alkaline phosphatase, and poly-A polymerase was used to catalyze the addition of Cs to the 3' end. The 5' end was phosphorylated using T4 PNK, and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension. This cloning protocol generated stranded reads that were read from the 5' ends of the inserts. The library was sequenced on a Solexa platform for a total of 36 cycles; however, the reads underwent post-processing, resulting in trimming of their 3' ends. Consequently, the mapped read lengths are variable. Analysis: K562 and GM12878 total cell, total RNA: Tags were removed from the 5' ends of the reads in accordance to their lengths and strand assignments made. Subsequently, the reads were trimmed from their 3' ends to a final length of 50 nucleotides and were mapped using NexAlign, a program developed by Timo Lassman, RIKEN. We allowed up to 2 mismatches across the entire length and only report reads that mapped to a single/unique locus in the assembled hg18 genome. K562 cytosol, polyA+ RNA: Reads were mapped to the human (hg18, March 2006) assembly using Nexalign, with only uniquely mapping (one loci), exactly matching (no mis-matches) aligned reads reported in the processed files, as follows: 1) Collect the read sequences from Illumina non-filtered output files. 2) Filter out all reads that contain undefined nucleotides ('N'). 3) Perform iterative alignment/C-tail chopping algorithm (below). On each alignment step, the reads are aligned to the genome with 100% identity. All reads that align to a single locus are withdrawn from the alignment pool and only the reads that could not be aligned continue to the next step. a) Align to the hg18 genome using Nexalign 1.3.3 (© Timo Lassmann) without chopping off any nucleotides. b) Chop off any C-blocks (until the first non-C) at the ends of the reads. c) Align to the genome -> remove and save those that align. d) Chop off any non-Cs until the next C. e) Chop off C-block until the next non-C. f) Align to the genome -> remove and save those that align. g) Repeat steps d, e, and f until the reads align to the genome, or chopping results in the reduction of the reads' lengths to below 16 (default), or there are no non-Cs left.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track was produced as part of the ENCODE project. It reports the percentage of DNA molecules that exhibit cytosine methylation. In general, DNA methylation within a gene's promoter is associated with gene silencing, and DNA methylation within the exons and introns of a gene is associated with gene expression. Proper regulation of DNA methylation is essential during development and aberrant DNA methylation is a hallmark of cancer. DNA methylation status was assayed with Whole Genome Bisulfite Sequencing (WGBS). Genomic DNA was sheared by sonication, end-repaired and then ligated to methylated sequencing adapters. The library fragments were treated with sodium bisulfite and amplified by PCR to convert every unmethylated cytosine to a thymine while leaving methylated cytosines intact. The sequenced fragments were aligned to a bisulfite-converted reference genome. For each assayed cytosine, the number of sequencing reads covering that C and the percentage of those reads that were methylated were reported. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf DNA methylation at cytosines across the genome was assayed with Whole Genome Bisulfite Sequencing (WGBS). WGBS was performed on cell lines grown by ENCODE production groups. WGBS was carried out by the Myers production group at the HudsonAlpha Institute for Biotechnology. Isolation of Genomic DNA: Genomic DNA was isolated from each cell line using the QIAGEN DNeasy Blood & Tissue Kit according to the instructions provided by the manufacturer. DNA concentrations for each genomic DNA preparation were determined using fluorescent DNA-binding dye and a fluorometer (Invitrogen Quant-iT dsDNA High Sensitivity Kit and Qubit Fluorometer). Typically, 2 µg of genomic DNA is used to make WGBS libraries. WGBS Library Construction and Sequencing: WGBS library construction started with sonication of genomic DNA on a Covaris S2 instrument. Sheared ends were then repaired and blunted with DNA polymerase I, T4 DNA polymerase and T4 polynucleotide kinase in the presence of dATP, dGTP and dTTP. After end repair, Klenow exo- DNA Polymerase was used to add an adenosine as a 3' overhang. Next, a methylated version of the Illumina paired-end adapters was ligated onto the DNA. Adapter-ligated 400 bp genomic DNA fragments were selected using a 2% agarose SizeSelect E-gel. The selected adapter-ligated fragments were treated with sodium bisulfite using the Zymo Research EZ DNA Methylation Gold Kit, which converts unmethylated cytosines to uracils and leaves methylated cytosines unchanged. Bisulfite-treated DNA was amplified in a final PCR reaction which was optimized to uniformly amplify diverse fragment sizes and sequence contexts in the same reaction. During this final PCR reaction, uracils were copied as thymines, resulting in a thymine in the PCR products wherever an unmethylated cytosine existed in the genomic DNA. These libraries were then sequenced with an Illumina HiSeq 2000 according to the manufacturer's recommendations as paired-end 50 bp reads. Libraries were sequenced to a depth of 600 million aligned reads. Data Analysis: To analyze the sequence data, Bismark (Krueger and Andrews, 2011) was used to align sequences reads. Generally, each read went through a conversion of Cs to Ts and was then aligned to fully converted plus and minus strands of the hg19 build of the human genome. A few custom refinements were made to the Bismark program. Since these libraries were made in a directional orientation with the first read always being C-poor, we skipped unnecessary alignments to impossible orientations. We also implemented a more stringent uniqueness filter, only allowing reads that have one acceptable alignment (based on default Bowtie parameters) across both strands. Once reads were aligned, the percent methylation was calculated for each cytosine using the original sequence reads. The percent methylation and number of reads is reported for each CpG in the wgEncodeHaibMethylWgbsXXXXCpg.bigBed file and for each non CpG cytosine in the wgEncodeHaibMethylWgbsXXXXNoncpg.bigBed file.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Florencia Pauli mailto:fpauli@hudsonalpha.org). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track shows average methylation status in CpG islands. In general, methylation of CpG sites within a promoter causes silencing of the gene associated with that promoter For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf CpG regions were assayed via Methyl-seq, a method developed in the Myers laboratory to measure the methylation status at CpGs throughout the genome. It combines DNA digestion by a methyl-sensitive enzyme HpaII and its methyl-insensitive isoschizomer MspI with the Illumina DNA sequencing platform. The method was first applied in a collaboration with the laboratory of Dr. Julie Baker at Stanford University to study methylation and gene expression changes that occur in human embryonic stem cells before and after differentiation to definitive endoderm. A paper describing the results as well as the method has been submitted for publication [1]. This study profiled genomic DNA and mRNA samples derived from two human embryonic stem cell lines: H9 and BG02. These cells were differentiated into definitive endoderm, embryoid bodies, embryoid body-derived cells, and AFP+ (alpha-fetoprotein positive) hepatocytes. These in vitro samples were profiled with Methyl-seq and compared them with normal tissue samples from 11-week and 24-week fetal liver and adult liver. Methyl-seq assays more than 250,000 methyl-sensitive restriction enzyme cleavage sites, representing more than 90,000 genomic regions. These regions include 35,528 annotated CpG islands, while the remaining 55,084 non-CpG island regions are distributed across the genome in promoters, genes, and intergenic regions. Sequence tags present in MspI libraries but not in HpaII libraries are derived from methylated regions. Conversely, sequence tags that occur in HpaII libraries come from at least partially unmethylated regions. In vitro differentiation: Definitive endoderm precursor cells were generated from H9 hES cells by treating them with activin A. Embryoid bodies (EBs) were generated by growing undifferentitated H9 and BG02 hESCs in suspension. EB-derived cells were obtained by plating clumps of the cells from the EBs. AFP+ fetal hepatocytes were derived from EBs by plating EB cells with FgF, followed by fluorescence activated cell sorting (FACS) to isolate cells expressing the green fluorescent protein (GFP) reporter gene driven from the AFP promoter. Isolation of genomic DNA: Genomic DNA is isolated from biological replicates of each cell line by using the QIAGEN DNeasy Blood & Tissue Kit according to the instructions provided by the manufacturer. DNA concentrations and a level of quality of each preparation is determined by UV absorbance. HpaII and MspI digestions: Cleavage of DNA by restriction endonuclease HpaII is prevented by the presence of a 5-methyl group at the internal C residue of its recognition sequence CCGG. MspI, an isoschizomer of HpaII, cleaves DNA irrespective of the presence of a methyl group at this position. For the MspI library, 5 µg genomic DNA was digested in a 100 µl reaction with 1X NEB Buffer2 and 20 units MspI restriction enzyme and incubated for 18 hr at 37°C. For the HpaII library, 5 µg genomic DNA was digested in a 100 µl reaction with 1X NEB Buffer1 and 20 units HpaII restriction enzyme and incubated for 18 hr at 37°C. Note that in subsequent versions of the Methyl-seq protocol, which will be described later, much lower amounts of genomic DNA were used (1 µg and potentially lower). DNA library construction and sequencing: High-throughput sequencing libraries were generated from DNA fragments of the HpaII or MspI digested genomic DNA according to the protocol posted at the website: http://myers.hudsonalpha.org/content/protocols.html. This approach was recently modified by removing the first PCR amplification step, just prior to the gel electrophoresis size-selection step, which was found to reduce a fragment-size bias in the sequencing libraries. These libraries were sequenced with an Illumina Genome Analyzer (GA2) according to the manufacturer's recommendations. Data analysis: For this analyis, reads that align to human genome sequence version hg19 and contain the 5'-CGG-3' HpaII-cut signature on their 5' end were used. These aligned sequence reads were mapped to CCGG sites predicted in silico on hg19. Sites with four or more Msp1 tags occurring in either the forward or reverse direction were retained for analysis. These "assayable" sites were then grouped with neighboring sites that are within 35-75 bp of each other. Thus, a "region" can be comprised of between 2 and 18 digestion sites that are each within 35-75 bp of another site. Methylated and non-methylated calls were made by using HpaII tag data from all assayable cut sites. For each site across each region, the larger of either the forward read count or reverse read count was used. Regions that have an average of 0 or 1 read per cut site are called methylated, and regions with more than one sequence read per site are called unmethylated.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Piero Carninci mailto:carninci@riken.jp). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track shows 5' cap analysis gene expression (CAGE) tags and clusters in RNA extracts (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=rnaExtract) from different sub-cellular localizations (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=localization) in multiple cell lines (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=cellType). A CAGE cluster is a region of overlapping tags with an assigned value that represents the expression level. The data in this track were produced as part of the ENCODE Transcriptome Project. Release 2 has three new downloads only files per experiment (Clusters, TSS Gencode 7 and TSS HMM) and four new cell lines (A459, AG04450, BJ and SK-N-SH_RA). Release 1 on hg19 contained the original data on hg18 (http://hgwdev.cse.ucsc.edu/cgi-bin/hgTrackUi?db=hg18&g=wgEncodeRikenCage) that was remapped and indicated in this release as Generation 0 since that data had no replicates. If there is both old and new generation data available for a particular experiment, only the new generation data is displayed and the older data is available for download. The new data for this track was done with a different process and has standard replicate numbers. The replicate labeling in the genome browser view is a counter indicating the total number of replicates submitted. The producing lab has replicate numbers that correspond to their internal bio-replicate numbering. Where these two numbering systems conflict, both are listed in the long label of the specific track. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell). RNA molecules longer than 200 nt were isolated from each subcellular compartment and then were fractionated into polyA+ and polyA- fractions as described in these protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/general/rnaExtracts.txt). The CAGE tags were sequenced from the 5' ends of cap-trapped cDNAs produced using RIKEN CAGE technology (Kodzius et al. 2006; Valen et al. 2009). To create the tag, a linker was attached to the 5' end of polyA+ or polyA- reverse-transcribed cDNAs which were selected by cap trapping (Carninci et al. 1996). The first 27 bp of the cDNA were cleaved using class II restriction enzymes. A linker was then attached to the 3' end of the cDNA. After PCR amplification, the tags were sequenced (36 bp single reads) using Illumina's Genome analyzer. Tags were mapped to the human genome (hg19) using the program delve (T. Lassmann manuscript in preparation). Delve is a new probabilistic aligner focused on giving the best possible alignment of reads to a genome rather than focusing on speed. It calculates the mapping accuracy (probability of each alignment being true or not) for each alignment. There is no set limit on the number of errors allowed and therefore the mapping rate is commonly 100%. However, for analysis it is recommended to discard alignments with low mapping qualities. Exceptions to the above protocol are the polyA- RNA samples from K562 cytosol, K562 nucleus, and prostate whole cell which were sequenced using ABI SOLiD (http://www.appliedbiosystems.com/absite/us/en/home/applications-technologies/solid-next-generation-sequencing.html) technology. These reads were mapped using Bowtie using default parameters. Clusters were defined as regions of overlapping CAGE reads. The expression level was computed as the number of reads making up the cluster, divided by the total number of reads sequenced, times 1 million.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (mailto:nshoresh@broad.mit.edu). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track displays maps of chromatin state generated by the Broad/MGH ENCODE group using ChIP-seq. Chemical modifications (methylation, acetylation) to the histone proteins present in chromatin influence gene expression by changing how accessible the chromatin is to transcription. The ChIP-seq method involves first using formaldehyde to cross-link histones and other DNA-associated proteins to genomic DNA within cells. The cross-linked chromatin is subsequently extracted, mechanically sheared, and immunoprecipitated using specific antibodies. After reversal of cross-links, the immunoprecipitated DNA is sequenced and mapped to the human reference genome. The relative enrichment of each antibody-target (epitope) across the genome is inferred from the density of mapped fragments. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf ChIP-seq: Cells were grown according to the approved ENCODE cell culture protocols. Cells were fixed in 1% formaldehyde and resuspended in lysis buffer. Chromatin was sheared to 200-700 bp using a Diagenode Bioruptor. Solubilized chromatin was immunoprecipitated with antibodies against each of the histone antibodies listed above. Antibody-chromatin complexes were pulled-down using protein A-sepharose (or anti-IgM-conjugated agarose for RNA polymerase II), washed and then eluted. After cross-link reversal and proteinase K treatment, immunoprecipitated DNA was extracted with phenol-chloroform, ethanol precipitated, treated with RNAse and purified. One to ten nanograms of DNA were end-repaired, adapter-ligated and sequenced by Illumina Genome Analyzers as recommended by the manufacturer. Alignment: Sequence reads from each IP experiment were aligned to the human reference genome (GRCh37/hg19) using MAQ with default parameters, except '-C 11' and '-H output_file', which outputs up to 11 additional best matches for each read (if any are found) to a file. This information was used to filter out any read that had more than 10 best matches on the genome. Note: It is likely that instances where multiple reads align to the same position and with the same orientation are due to enhanced PCR amplification of a single DNA fragment. No attempt has been made, however, to remove such artifacts from the data, following ENCODE practices. Signal: Fragment densities were computed by counting the number of reads overlapping each 25 bp bin along the genome. Densities were computed using igvtools count with default parameters (in particular, '-w 25' to set window size of 25 bp and '-f mean' to report the mean value across the window), except for '-e' set to extend the reads to 200 bp, and the .wig output was converted to bigWig using wigToBigWig from the UCSC Kent software package. Peaks: Discrete intervals of ChIP-seq fragment enrichment were identified using Scripture, a scan statistics approach, under the assumption of uniform background signal. All data sets where processed with '-task chip', and with '-windows 100,200,500,1000,5000,10000,100000'. (No mask file nor the '-trim' option have been used.) The resulting called segments were then further filtered to remove intervals that are significantly enriched only because they contain smaller enriched intervals within them. This post-processing step has been implemented using Matlab. The use of the post-processing step allowed very large enriched intervals (of the order of Mbps for H3K27me3, for instance) to be detected, as well as much smaller intervals, without the need to tailor the parameters of Scripture based on prior expectations.

Project description:The tracks show enrichment of RNA sequence tags generated by high throughput sequencing (RNA-seq) and mapped to the human genome. Double stranded cDNA was synthesized from polyadenylated RNA (polyA+) . PCR amplified, adapter ligated cDNA, 150-300nt long, was sequenced on an Illumina GA sequencer. Where designated, cell lines received specific treatments prior to RNA isolation. As indicated, K562 cells were treated with either interferon-a or interferon-g for 30 minutes or 6 hours. These experiments were carried out in conjunction with ChIP-Seq experiments on the transcription factors STAT1 and STAT2 in order to examine the effects that inducers of a specific transcriptional response might have on gene expression and on transcription factor binding site discovery. K562 cells were treated with a-amanitin in order to examine the effects of RNA polymerase II inhibition on RNA polymerase III-mediated transcription. This track shows expression data generated as confirmation of the SYDH TFBS (http://genome-preview.ucsc.edu/cgi-bin/hgTrackUi?db=hg19&g=wgEncodeSydhTfbs) tracks currently available on genome-preview. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell). Total RNA was extracted using TRIzol reagents (15596-018, Life Tech), following the manufacturer's protocol. For polyA+ samples, polyadenylated RNA was purified using the MicroPoly(A) Purist kit (AM1919, Life Tech) and fragmented using RNA Fragmentation Reagent (AM8740, Life Tech). Illumina adapters were ligated to double stranded cDNA which was synthesized using reagents from Life Tech (11917-010). PCR amplified adapter ligated cDNA (150-300 bp) was sequenced using Illumina GA. Sequence reads of 27-33nt long with 0-2 mismatches were mapped to the genome. The signal height corresponds to the number of overlapping fragments at each nucleotide position in the genome. Samples originally mapped to the hg18 version of the human genome were remapped to hg19 using the BWA aligner, version 0.5.7.

Dataset Information

Small RNA-seq from ENCODE/Cold Spring Harbor Lab

Publications

Landscape of transcription in human cells.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets