Unknown,Transcriptomics,Genomics,Proteomics

Dataset Information

0

RNA-seq from ENCODE/Caltech


ABSTRACT: This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (mailto:georgi@caltech.edu for data coordination/informatics/experimental questions, mailto:diane@caltech.edu for informatics questions, mailto:bawilli_91125@yahoo.com for experimental questions). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly. RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high throughput DNA sequencing, which was done here on an Illumina Genome Analyzer (GAI or GAIIx) (Mortazavi et al., 2008). The transcriptome measurements shown on these tracks were performed on polyA selected RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=longPolyA&type=rnaExtract) from total cellular RNA (http://genome.ucsc.edu/cgi-bin/hgEncodeVocab?term=cell&type=localization) using two different protocols - one that preserves information about which strand the read is coming from and one that does not. Due to the specifics of the enzymology of library construction, gene and transcript quantification is more accurate based on the non-strand-specific protocol, while the strand-specific protocol is useful for assigning strandedness, but in general less reliable for quantification. Non-strand-specific protocol (deep "reference" transcriptome measurements, 2x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming and amplified. Data have been produced in two formats: single reads, each of which comes from one end of a cDNA molecule, and paired-end reads, which are obtained as pairs from both ends of cDNAs. This RNA-seq protocol does not specify the coding strand. As a result, there will be ambiguity at loci where both strands are transcribed. The "randomly primed" reverse transcription is, apparently, not fully random. This is inferred from a sequence bias in the first residues of the read population, and this likely contributes to observed unevenness in sequence coverage across transcripts. Strand specific protocol (1x75 bp reads): PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis. 3' adapters were ligated to the 3' end of fragments, then 5' adapters were ligated to the 5' end. The resulting RNA molecules were converted to cDNA and amplified. This RNA-seq protocol does specify the coding strand as each read is in the same 5'-3' orientation as the original RNA strand. As a result, loci where both strands are transcribed can be disambiguated. However, RNA ligation is an inherently biased process and as a result greater unevenness in sequence coverage across transcripts is observed compared to the non-strand-specific data, and quantification is less accurate. Data Analysis: Reads were aligned to the hg19 human reference genome using TopHat, a program specifically designed to align RNA-seq reads and discover splice junctions de novo. Cufflinks, a de novo transcript assembly and quantification software package, was run on the TopHat alignments to discover and quantify novel transcripts and to obtain transcript expression estimates based on the GENCODE annotation. All sequence files, alignments, gene and transcript models and expression estimates files are available for download. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Experimental Procedures: Cells were grown according to the approved ENCODE cell culture protocols except for H1-hESC for which frozen cell pellets were purchased from Cellular Dynamics. Cells were lysed in RLT buffer (Qiagen RNEasy kit) and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNAse digestion step to remove residual genomic DNA. 75 µgs of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. For 2x75 bp non-stranded RNA-seq, 100 ngs of mRNA was then processed according to the protocol in Mortazavi et al (2008), and prepared for sequencing on the Genome Analyzer flow cell according to the protocol for the ChIPSeq DNA genomic DNA kit (Illumina). The majority of paired-end libraries were size-selected around 200 bp (fragment length) with the exception of a few additional replicates that were size-selected at 400 bp with the specific intent to investigate the effect of fragment length on results. Strand-specific RNA-seq libraries were prepared from 100ng of mRNA from the same preparation following Illumina's Strand-Specific RNA-seq protocol . Libraries were sequenced with an Illumina Genome Analyzer I or an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Reads of 75 bp length were obtained, single end for directional, strand-specific libraries (1x75D) and paired end for non-strand-specific libraries (2x75). Data Processing and Analysis: Reads were mapped to the reference human genome (version hg19), with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes and haplotypes in all cases, using TopHat (version 1.0.14). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance. After mapping reads to the genome and identifying splice junctions, the data was further analyzed using the transcript assembly and quantification software Cufflinks (version 0.9.3) using the sequence bias detection and correction option. Cufflinks was used in two modes: first, expression for genes and individual transcripts was quantified based on the GENCODE annotation, for both versions v3c and v4 of GENCODE GRCh37, and second, Cufflinks was run in de novo transcript assembly and quantification mode to obtain candidate novel transcript and gene models and expression estimates for them.

ORGANISM(S): Homo sapiens

SUBMITTER: ENCODE DCC 

PROVIDER: E-GEOD-33480 | biostudies-arrayexpress |

REPOSITORIES: biostudies-arrayexpress

altmetric image

Publications

Landscape of transcription in human cells.

Djebali Sarah S   Davis Carrie A CA   Merkel Angelika A   Dobin Alex A   Lassmann Timo T   Mortazavi Ali A   Tanzer Andrea A   Lagarde Julien J   Lin Wei W   Schlesinger Felix F   Xue Chenghai C   Marinov Georgi K GK   Khatun Jainab J   Williams Brian A BA   Zaleski Chris C   Rozowsky Joel J   Röder Maik M   Kokocinski Felix F   Abdelhamid Rehab F RF   Alioto Tyler T   Antoshechkin Igor I   Baer Michael T MT   Bar Nadav S NS   Batut Philippe P   Bell Kimberly K   Bell Ian I   Chakrabortty Sudipto S   Chen Xian X   Chrast Jacqueline J   Curado Joao J   Derrien Thomas T   Drenkow Jorg J   Dumais Erica E   Dumais Jacqueline J   Duttagupta Radha R   Falconnet Emilie E   Fastuca Meagan M   Fejes-Toth Kata K   Ferreira Pedro P   Foissac Sylvain S   Fullwood Melissa J MJ   Gao Hui H   Gonzalez David D   Gordon Assaf A   Gunawardena Harsha H   Howald Cedric C   Jha Sonali S   Johnson Rory R   Kapranov Philipp P   King Brandon B   Kingswood Colin C   Luo Oscar J OJ   Park Eddie E   Persaud Kimberly K   Preall Jonathan B JB   Ribeca Paolo P   Risk Brian B   Robyr Daniel D   Sammeth Michael M   Schaffer Lorian L   See Lei-Hoon LH   Shahab Atif A   Skancke Jorgen J   Suzuki Ana Maria AM   Takahashi Hazuki H   Tilgner Hagen H   Trout Diane D   Walters Nathalie N   Wang Huaien H   Wrobel John J   Yu Yanbao Y   Ruan Xiaoan X   Hayashizaki Yoshihide Y   Harrow Jennifer J   Gerstein Mark M   Hubbard Tim T   Reymond Alexandre A   Antonarakis Stylianos E SE   Hannon Gregory G   Giddings Morgan C MC   Ruan Yijun Y   Wold Barbara B   Carninci Piero P   Guigó Roderic R   Gingeras Thomas R TR  

Nature 20120901 7414


Eukaryotic cells make many types of primary and processed RNAs that are found either in specific subcellular compartments or throughout the cells. A complete catalogue of these RNAs is not yet available and their characteristic subcellular localizations are also poorly understood. Because RNA represents the direct output of the genetic information encoded by genomes and a significant proportion of a cell's regulatory capabilities are focused on its synthesis, processing, transport, modification  ...[more]

Similar Datasets

2012-07-12 | GSE33480 | GEO
2011-07-13 | E-GEOD-30567 | biostudies-arrayexpress
2012-05-10 | GSE37909 | GEO
2012-04-02 | E-GEOD-36025 | biostudies-arrayexpress
2011-09-29 | GSE32465 | GEO
2011-07-13 | GSE30567 | GEO
2011-11-10 | GSE33600 | GEO
2011-06-03 | E-GEOD-29692 | biostudies-arrayexpress
2012-05-09 | E-GEOD-37909 | biostudies-arrayexpress
2012-04-27 | E-GEOD-35584 | biostudies-arrayexpress