Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

RNA-seq from ENCODE/Caltech

ABSTRACT: This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (mailto:georgi@caltech.edu for data coordination/informatics/experimental questions, mailto:diane@caltech.edu for informatics questions, mailto:bawilli_91125@yahoo.com for experimental questions). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high throughput DNA sequencing, which was done here on an Illumina Genome Analyzer (GAI or GAIIx) (Mortazavi et al., 2008). The transcriptome measurements shown on these tracks were performed on polyA selected RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=longPolyA&type=rnaExtract) from total cellular RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=cell&type=localization) using two different protocols - one that preserves information about which strand the read is coming from and one that does not. Due to the specifics of the enzymology of library construction, gene and transcript quantification is more accurate based on the non-strand-specific protocol, while the strand-specific protocol is useful for assigning strandedness, but in general less reliable for quantification. Non-strand-specific protocol (deep "reference" transcriptome measurements, 2x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming and amplified. Data have been produced in two formats: single reads, each of which comes from one end of a cDNA molecule, and paired-end reads, which are obtained as pairs from both ends of cDNAs. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts. Strand specific protocol (1x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis. 3' adapters were ligated to the 3' end of fragments, then 5' adapters were ligated to the 5' end. The resulting RNA molecules were converted to cDNA and amplified. This RNA-seq protocol does specify the coding strand as each read is in the same 5'-3' orientation as the original RNA strand. As a result, loci where both strands are transcribed can be disambiguated. However, RNA ligation is an inherently biased process and as a result greater unevenness in sequence coverage across transcripts is observed compared to the non-strand-specific data, and quantification is less accurate. Data Analysis: Reads were aligned to the hg19 human reference genome using TopHat, a program specifically designed to align RNA-seq reads and discover splice junctions de novo. Cufflinks, a de novo transcript assembly and quantification software package, was run on the TopHat alignments to discover and quantify novel transcripts and to obtain transcript expression estimates based on the GENCODE annotation. All sequence files, alignments, gene and transcript models and expression estimates files are available for download. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Experimental Procedures: Cells were grown according to the approved ENCODE cell culture protocols except for H1-hESC for which frozen cell pellets were purchased from Cellular Dynamics. Cells were lysed in RLT buffer (Qiagen RNEasy kit) and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNAse digestion step to remove residual genomic DNA. 75 µgs of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. For 2x75 bp non-stranded RNA-seq, 100 ngs of mRNA was then processed according to the protocol in Mortazavi et al (2008), and prepared for sequencing on the Genome Analyzer flow cell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). The majority of paired-end libraries were size-selected around 200 bp (fragment length) with the exception of a few additional replicates that were size-selected at 400 bp with the specific intent to investigate the effect of fragment length on results. Strand-specific RNA-seq libraries were prepared from 100ng of mRNA from the same preparation following Illumina's Strand-Specific RNA-seq protocol . Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Reads of 75 bp length were obtained, single end for directional, strand-specific libraries (1x75D) and paired end for non-strand-specific libraries (2x75). Data Processing and Analysis: Reads were mapped to the reference human genome (version hg19), with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes and haplotypes in all cases, using TopHat (version 1.0.14). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance. After mapping reads to the genome and identifying splice junctions, the data was further analyzed using the transcript assembly and quantification software Cufflinks (version 0.9.3) using the sequence bias detection and correction option. Cufflinks was used in two modes: first, expression for genes and individual transcripts was quantified based on the GENCODE annotation, for both versions v3c and v4 of GENCODE GRCh37, and second, Cufflinks was run in de novo transcript assembly and quantification mode to obtain candidate novel transcript and gene models and expression estimates for them.

ORGANISM(S): Homo sapiens

SUBMITTER: ENCODE DCC

PROVIDER: E-GEOD-33480 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

ACCESS DATA

Publications

Landscape of transcription in human cells.

Djebali Sarah S Davis Carrie A CA Merkel Angelika A Dobin Alex A Lassmann Timo T Mortazavi Ali A Tanzer Andrea A Lagarde Julien J Lin Wei W Schlesinger Felix F Xue Chenghai C Marinov Georgi K GK Khatun Jainab J Williams Brian A BA Zaleski Chris C Rozowsky Joel J Röder Maik M Kokocinski Felix F Abdelhamid Rehab F RF Alioto Tyler T Antoshechkin Igor I Baer Michael T MT Bar Nadav S NS Batut Philippe P Bell Kimberly K Bell Ian I Chakrabortty Sudipto S Chen Xian X Chrast Jacqueline J Curado Joao J Derrien Thomas T Drenkow Jorg J Dumais Erica E Dumais Jacqueline J Duttagupta Radha R Falconnet Emilie E Fastuca Meagan M Fejes-Toth Kata K Ferreira Pedro P Foissac Sylvain S Fullwood Melissa J MJ Gao Hui H Gonzalez David D Gordon Assaf A Gunawardena Harsha H Howald Cedric C Jha Sonali S Johnson Rory R Kapranov Philipp P King Brandon B Kingswood Colin C Luo Oscar J OJ Park Eddie E Persaud Kimberly K Preall Jonathan B JB Ribeca Paolo P Risk Brian B Robyr Daniel D Sammeth Michael M Schaffer Lorian L See Lei-Hoon LH Shahab Atif A Skancke Jorgen J Suzuki Ana Maria AM Takahashi Hazuki H Tilgner Hagen H Trout Diane D Walters Nathalie N Wang Huaien H Wrobel John J Yu Yanbao Y Ruan Xiaoan X Hayashizaki Yoshihide Y Harrow Jennifer J Gerstein Mark M Hubbard Tim T Reymond Alexandre A Antonarakis Stylianos E SE Hannon Gregory G Giddings Morgan C MC Ruan Yijun Y Wold Barbara B Carninci Piero P Guigó Roderic R Gingeras Thomas R TR

Nature 20120901 7414

Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification ...[more]

PMID: 22955620

Similar Datasets

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (mailto:georgi@caltech.edu for data coordination/informatics/experimental questions, mailto:diane@caltech.edu for informatics questions, mailto:bawilli_91125@yahoo.com for experimental questions). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high throughput DNA sequencing, which was done here on an Illumina Genome Analyzer (GAI or GAIIx) (Mortazavi et al., 2008). The transcriptome measurements shown on these tracks were performed on polyA selected RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=longPolyA&type=rnaExtract) from total cellular RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=cell&type=localization) using two different protocols - one that preserves information about which strand the read is coming from and one that does not. Due to the specifics of the enzymology of library construction, gene and transcript quantification is more accurate based on the non-strand-specific protocol, while the strand-specific protocol is useful for assigning strandedness, but in general less reliable for quantification. Non-strand-specific protocol (deep "reference" transcriptome measurements, 2x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming and amplified. Data have been produced in two formats: single reads, each of which comes from one end of a cDNA molecule, and paired-end reads, which are obtained as pairs from both ends of cDNAs. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts. Strand specific protocol (1x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis. 3' adapters were ligated to the 3' end of fragments, then 5' adapters were ligated to the 5' end. The resulting RNA molecules were converted to cDNA and amplified. This RNA-seq protocol does specify the coding strand as each read is in the same 5'-3' orientation as the original RNA strand. As a result, loci where both strands are transcribed can be disambiguated. However, RNA ligation is an inherently biased process and as a result greater unevenness in sequence coverage across transcripts is observed compared to the non-strand-specific data, and quantification is less accurate. Data Analysis: Reads were aligned to the hg19 human reference genome using TopHat, a program specifically designed to align RNA-seq reads and discover splice junctions de novo. Cufflinks, a de novo transcript assembly and quantification software package, was run on the TopHat alignments to discover and quantify novel transcripts and to obtain transcript expression estimates based on the GENCODE annotation. All sequence files, alignments, gene and transcript models and expression estimates files are available for download. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Carrie Davis mailto:davisc@cshl.edu (experimental), Alex Dobin mailto:dobin@cshl.edu (computational), Felix Schlesinger mailto:schlesin@cshl.edu (computational), Tom Gingeras mailto:gingeras@cshl.edu (primary investigator), and Roderic Guigo's group mailto:rguigo@imim.es at the CRG). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). These tracks were generate by the ENCODE Consortium. They contain information about human RNAs > 200 nucleotides in length obtained as short reads off the Illumina GAIIx platform. Data is available from biological replicates of several cell lines. In addition to profiling Poly-A+ and Poly-A- RNA from whole cells, we have also gather data from various subcellular compartments. In many cases, there are Cap Analysis of Gene Expression (CAGE, RIKEN Institute) and Small RNA-Seq (<200 nucleotides, CSHL) and Pair-End di-TAG-RNA (PET-RNA, Genome Institute of Singapore) datasets available from the same biological replicates. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf We are using the published protocol http://www.ncbi.nlm.nih.gov/pubmed/19620212. This protocol generates directional libraries and reports the transcripts strand of origin. Exogenous RNA spike-ins (Round 5, pool 14), in development at National Institutes Standards Technology were added to each endogenous RNA isolate and carried through library construction and sequencing. The Illumina PhiX control library was also spiked-in at 1% to each completed human library just prior to cluster formation. Accompanying each RNA-Seq dataset is a "Production Document". This document contains details about the RNA isolations and treatments, library construction, spike-ins as well as quality control figures for individual libraries. The spike-in sequence and the concentrations can are available for download in the supplemental directory. The libraries are sequenced on the Illumina platform to an average depth of ~200 million reads (100 million mate-pairs). The data are mapped against hg19 using Spliced Transcript Alignment and Reconstruction (STAR) written by Alex Dobin (CSHL). More information, about STAR including the parameters used for these data can be found at: http://gingeraslab.cshl.edu/STAR/. Additionally, we provide the following processed "element" data files: de novo splice junctions, de novo transcripts, and contigs. These elements are assessed for reproducibility using a nonparametric irreproducible detection (IDR) rate script. The IDR values for each element are included in the files for end-users to threshold on. An IDR value of 0.1 means that the probability of detecting that element in a third experiment equivalent in depth to the the sum of the bioreplicates is 90%. In addition, we also compute expression values for annotated genes, transcripts and exons.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Carrie Davis mailto:davisc@cshl.edu (experimental), Roderic Guigo mailto:rguigo@imim.es and lab (data processing) and Tom Gingeras mailto:gingeras@cshl.edu (primary investigator)). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). These tracks were generated by the ENCODE Consortia. They contain information about mouse RNAs > 200 nucleotides in length obtained as short reads off the Illumina platform. Data are available from biological replicates. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Tissue Samples: Individual tissues were harvested from mouse strain C57BL/6NJ at different timepoints according to ENCODE cell culture protocols. Whenever possible biological replicates from litermates. Library Preparation: The published cDNA sequencing protocol was used. This protocol generates directional libraries and reports the transcripts' strand of origin. Exogenous RNA spike-ins were added to each endogenous RNA isolate and carried through library construction and sequencing. The spike-in sequence and the concentrations are available for download in the supplemental directory. Sequencing and Mapping: The libraries were sequenced on the Illumina platform (either GAIIx or Hi-Seq) in mate-pair fashion (either pair-end 76 or pair-end 101) to an average depth of 100 million mate-pairs. The data were mapped against hg19 using Spliced Transcript Alignment and Reconstruction (STAR) written by Alex Dobin (CSHL). More information about STAR, including the parameters used for these data, is available from the Gingeras lab. Verification: FPKM (fragments per kilobase of exon per million fragments mapped) values were calculated for annotated exons and Spearman correlation coefficients were computed. In general, Rho values are > .90 between biological replicates.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Richard Sandstrom mailto:sull@u.washington.edu). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. This track shows DNaseI sensitivity measured genome-wide in different cell lines using the Digital DNaseI methodology (see below), and DNaseI hypersensitive sites. DNaseI has long been used to map general chromatin accessibility and DNaseI hypersensitivity is a universal feature of active cis-regulatory sequences. The use of this method has led to the discovery of functional regulatory elements that include enhancers, insulators, promotors, locus control regions and novel elements. For each experiment (cell type) this track shows DNaseI sensitivity as a continuous function using sequencing tag density (Raw Signal), and discrete loci of DNaseI sensitive zones (HotSpots) and hypersensitive sites (Peaks)." For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, isolating DNaseI 'double-hit' fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (36 bp reads). Uniquely mapping high-quality reads were mapped to the genome. DNaseI sensitivity is directly reflected in raw tag density (Raw Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI sensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). 1.0% false discovery rate thresholds (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within FDR 1.0% hypersensitive zones using a peak-finding algorithm.

Project description:RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly (Mortazavi et al., 2008). RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high-throughput DNA sequencing, which was done here on the Illumina HiSeq sequencer. The transcriptome measurements shown on these tracks were performed on polyA selected RNA (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?term=longPolyA&type=rnaExtract) from total cellular RNA (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?term=cell&type=localization). PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming and amplified. Paired-end 2x100 bp reads were obtained from each end of a cDNA fragment. Reads were aligned to the mm9 human reference genome using TopHat (Trapnell et al., 2009), a program specifically designed to align RNA-seq reads and discover splice junctions de novo. All sequence and alignments files are available at http://hgwdev.cse.ucsc.edu/cgi-bin/hgFileUi?db=mm9&g=wgEncodeCaltechRnaSeq. Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell/mouse). Cells were lysed in RLT buffer (Qiagen RNEasy kit), and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNAse digestion step to remove residual genomic DNA. A quantity of 75 µgs of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. A quantity of 100 ngs of mRNA was then processed according to the protocol in Mortazavi et al. (2008), and prepared for sequencing on the Illumina GAIIx or HiSeq platforms according to the protocol for the ChIP-Seq DNA genomic DNA kit (Illumina). Paired-end libraries were size-selected around 200 bp (fragment length). Libraries were sequenced with the Illumina HiSeq according to the manufacturer's recommendations. Paired-end reads of 100 bp length were obtained. Reads were mapped to the reference mouse genome (version mm9 with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes in all cases) using TopHat (version 1.3.1) (http://tophat.cbcb.umd.edu/). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance and supplying known ENSEMBL version 63 splice junctions.

Dataset Information

RNA-seq from ENCODE/Caltech

Publications

Landscape of transcription in human cells.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets