Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Yijun Ruan mailto:ruanyj@gis.a-star.edu.sg). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Transcriptome Project. It shows the starts and ends of full length mRNA transcripts determined by GIS paired-end ditag (PET) sequencing using RNA extracts (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=rnaExtract) from different sub-cellular localizations (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=localization) in different cell lines (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?type=cellType). The RNA-PET information provided in this track is composed of two different PET length versions based on how the PETs were extracted. The cloning-based PET (18 bp and 16 bp) is an earlier version and detailed information can be found from reference (Ng et al. 2006). The cloning-free PET (25 bp and 25 bp) is a recently modified version which uses Type II enzyme EcoP15I to generate a longer length of PET (unpublished), which results in a significant enhancement in both library construction and mapping efficiency. Both versions of PET templates were sequenced by Illumina platform at 2 x 36 bp Paired End sequencing. See the Methods and References sections below for more details. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell). Two different GIS RNA-PET protocols were used to generate the full length transcriptome PETs: one is based on a cloning-free RNA-PET library construction and sequencing strategy (unpublished), and the other is a cloning-based library construction (Ng et al. 2005) and recent Illumina paired end sequencing. Cloning-free RNA-PET (50 bp reads, 25 bp and 25 bp tag for each of the 5' and 3' ends)--Method: The cloning-free RNA-PET libraries were generated from polyA mRNA samples and constructed using a recently modified GIS protocol (unpublished). Total RNA in good quality was used as starting material and purified through MACs polyT column to obtain full length polyA mRNAs. Approximately 5 micrograms of enriched polyA mRNA were used for reverse transcription to convert polyA mRNA to full length cDNA. The obtained full length cDNA was modified and ligated with specific linker sequences, followed by circularization through ligation to generate circular cDNA molecules. The 25 bp tag from each end of the full length cDNA was extracted by type II enzyme EcoP15I digestion. The resulting PETs were ligated with sequencing adaptors at the both ends, amplified by PCR, and further purified as complex templates for paired end (PE) sequencing using Illumina platforms. Data: The sequenced RNA-PETs are unified in 25/25 bp length from each end of a cDNA. After filtering out redundant and noise tags, the unique PETs will proceed to analysis pipeline. Initially, the orientation of each tag will be screened out by the barcode built in the sequencing-template, then paired into a given orientation-PET. The orientation-determined RNA-PET is mapped onto reference genome allowing up to two mismatches. Majority of PETs are mapped on the known transcripts, or splice variants. A small portion of misaligned PETs, defined as discordant PETs, are mapped either too far from each tag, have wrong orientations, or mapped in different chromosomes, indicating exist some transcription variations which could be caused by genome structure variations: such as fusion, deletion, insertion, inversion, tandem repeat and translocation; or RNA trans-splicing etc. Cloning-based RNA-PET (34 bp reads, 18 bp and 16 bp tag for each of the 5' and 3' ends)--Method: The cloning-based RNA-PET (GIS-PET) libraries were generated from polyA RNA samples and constructed using the protocol described by Ng et al. (2005). Total RNA in good quality was used as starting material and further purified through MACs polyT column to enrich polyA mRNA and remove any contaminants (e.g., rRNA, tRNA, DNA, protein etc). Approximately 10 micrograms of polyA mRNA were then used for reverse transcription to convert polyA mRNA into full length cDNA. The obtained full length cDNA was modified with specific linker sequences, then, ligated to a GIS-developed (pGIS4) vector to form a complex full length cDNA library, which was cloned into E. coli. The plasmid DNA was then isolated from the library, followed by MmeI (a type II enzyme) digestion to generate a final length of 18 bp/16 bp ditags from each end of the full length cDNA. The single ditag (or called PET) was then ligated to form a diPET structure (a concatemer with two unrelated PET linked by a linker sequence) to facilitate Illuminaa Paired End sequencing. Data: The cloning-based RNA-PETs are unified in 18 bp and 16 bp length, respectively extracted from 5' and 3' end of each cDNA. The redundant reads were filtered out initially and unique ones were included for analysis. PET sequences were then mapped to (GRCh37, hg19, excluding mitochondirion, haplotypes, randoms and chromosome Y) reference genome using the following specific criteria (Ruan et al. 2007): A minimal continuous 16 bp match must exist for the 5' signature; the 3' signature must have a minimal continuous 14 bp match. Both 5' and 3' signatures must be present on the same chromosome. Their 5' to 3' orientation must be correct (5' signature followed by 3' signature). The maximal genomic span of a PET genomic alignment must be less than one million bp. PETs mapping to 2-10 locations are also included and may represent duplicated genes or pseudogenes in the genome. A majority of PETs mapped on the known transcripts or splice variants. A small portion of misaligned PETs, defined as discordant PETs, were mapped either too far from each other, mapped in the wrong orientation, or mapped to different chromosomes, indicating that some transcription variations exist which could be caused by genome structure variations: such as fusion, deletion, insertion, inversion, tandem repeat and translocation; or RNA trans-splicing etc. Clusters: To cluster the PETs the following procedure was applied: the mapping location of the 5' and 3' tag of a given PET was extended by 100 bp in both directions creating 5' and 3' search windows. If the 5' and 3' tags of a second PET mapped within the 5' and 3' search window of the first PET then the two PETs were clustered and the search windows were adjusted so that they contained the tag extensions of the second PET. PETs which subsequently mapped with their 5' and 3' tags within the adjusted 5' and 3' search window, respectively, were also assigned to this cluster and search window readjusted. This iterative process continued till no new PET was found to fall within the search window, at which stage all the found PETs are classified as belonging to a single cluster. This process is repeated till all PETs are assigned to a cluster. Verification: To assess overall PET quality and mapping specificity, the top ten most abundant PET clusters that mapped to well-characterized known genes were examined. Over 99% of the PETs represented full-length transcripts, and the majority fell within 10 bp of the known 5' and 3' boundaries of these transcripts. The PET mapping was further verified by confirming the existence of physical cDNA clones represented by the ditags. PCR primers were designed based on the PET sequences and amplified the corresponding cDNA inserts either from full length cDNA library (cloning-based PET) or from total RNA isolate (cloning-free PET) for sequencing confirmation.
2011-11-10 | E-GEOD-33600 | biostudies-arrayexpress