Project description:This track is produced as part of the ENCODE Project. It shows high throughput sequencing of RNA samples from tissues or sub cellular compartments from cell lines included in the ENCODE Transcriptome subproject. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf The RNA-Seq data were generated from high quality polyA RNA, and the RNA-Seq libraries were constructed using SOLiD Whole Transcriptome (WT) protocol and reagent kit. Total RNA in good quality was used as starting materials and purified twice through MACs polyT column aimed to enrich polyA and remove any contaminants (e.g., rRNA, tRNA, DNA, protein etc.). A one microgram enriched polyA RNA sample was then fragmented to small pieces, and a gel-based selection method was performed to collect fragmented random polyA at a size-range of 50-150 nt in length. The collected fragmental RNA was then hybridized and ligated to a mix of adapters provided from ABI, followed by reverse transcription to generate corresponding cDNAs. The resulting cDNA library was further amplified by PCR and sequenced by SOLiD platform for single reads at 35 bp length (new version in 50 bp length). Cells were grown according to the approved ENCODE cell culture protocols. Data: The SOLiD-generated RNA-Seq reads were 35 bp in length. An initial filtering process was performed to remove any non-desirable contamination sequences, such as rRNA, tRNA, and repeats etc. A read-split mapping approach was developed to map the 35 bp reads onto the reference genome (GRCh37/hg19) excluding mitochondrion, haplotypes, randoms and chromosome Y. Mapping parameters: Strand specific mapping was done using Applied Biosystems' SOLiD alignment where all the reads were mapped to the genome, and to exon-exon junction database. Seed and extend strategy is adopted where initial seed length of 25 is mapped with maximum of 2 mismatches and then extended to read length, each color space match is awarded a score of +1 and each mismatch is awarded a penalty of -2. After extension each read is trimmed to its maximum score, shortest length. The color space sequences are then converted into base space and checked to ensure that each sequence has a maximum of 2 base pair mismatches. If any sequence has more than 2 mismatches, then that sequence is discarded.

Project description:This track depicts high throughput sequencing of long RNAs (>200 nt) from RNA samples from tissues or subcellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Sample preparation and sequencing: K562 and GM12878 total cell, total RNA: Standard Illumina Pair-end kit with the sole exception that a "tagged" random hexamer was used to prime the 1st strand synthesis: 5'-ACTGTAGGN6-3'. The addition of this tag is what permits us to make strand assignments for the reads. The sequence of the tag is reported in the 5' end of the read. Asymmetric PCR can place the tag on either the 1st or 2nd read depending on which strand it used as a template. Strand assignments are made by looking for the tag at the 5' end of either read 1 or read 2. Read 1 is physically linked to read 2. Therefore, if a tag is present on one end strand assignments are made for both ends. We noted during analysis that the tags are generally 5' truncated. We only "strand" reads that contain ACTGTAGG, CTGTAGG, TGTAGG, GTAGG. Between 63-68% of reads could be stranded in these libraries. It is possible to cull additional stranded reads that contain non-templated TAGG, AGG, GG, or G sequences at their 5' end. The peak in insert size distribution is between 200-250 nucleotides. K562 cytosol, polyA+ RNA: Oligo-dT selected poly-A+ RNA was RiboMinus-treated according to the manufacturer's protocol (Invitrogen). The RNA was treated with tobacco alkaline pyrophosphatase to eliminate any 5' cap structures and hydrolyzed to ~200 bases via alkaline hydrolysis. The 3' end was repaired using calf intestinal alkaline phosphatase, and poly-A polymerase was used to catalyze the addition of Cs to the 3' end. The 5' end was phosphorylated using T4 PNK, and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension. This cloning protocol generated stranded reads that were read from the 5' ends of the inserts. The library was sequenced on a Solexa platform for a total of 36 cycles; however, the reads underwent post-processing, resulting in trimming of their 3' ends. Consequently, the mapped read lengths are variable. Analysis: K562 and GM12878 total cell, total RNA: Tags were removed from the 5' ends of the reads in accordance to their lengths and strand assignments made. Subsequently, the reads were trimmed from their 3' ends to a final length of 50 nucleotides and were mapped using NexAlign, a program developed by Timo Lassman, RIKEN. We allowed up to 2 mismatches across the entire length and only report reads that mapped to a single/unique locus in the assembled hg18 genome. K562 cytosol, polyA+ RNA: Reads were mapped to the human (hg18, March 2006) assembly using Nexalign, with only uniquely mapping (one loci), exactly matching (no mis-matches) aligned reads reported in the processed files, as follows: 1) Collect the read sequences from Illumina non-filtered output files. 2) Filter out all reads that contain undefined nucleotides ('N'). 3) Perform iterative alignment/C-tail chopping algorithm (below). On each alignment step, the reads are aligned to the genome with 100% identity. All reads that align to a single locus are withdrawn from the alignment pool and only the reads that could not be aligned continue to the next step. a) Align to the hg18 genome using Nexalign 1.3.3 (© Timo Lassmann) without chopping off any nucleotides. b) Chop off any C-blocks (until the first non-C) at the ends of the reads. c) Align to the genome -> remove and save those that align. d) Chop off any non-Cs until the next C. e) Chop off C-block until the next non-C. f) Align to the genome -> remove and save those that align. g) Repeat steps d, e, and f until the reads align to the genome, or chopping results in the reduction of the reads' lengths to below 16 (default), or there are no non-Cs left.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Florencia Pauli mailto:fpauli@hudsonalpha.org). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE project. The track reports the percentage of DNA molecules that exhibit cytosine methylation at specific CpG dinucleotides. In general, DNA methylation within a gene's promoter is associated with gene silencing, and DNA methylation within the exons and introns of a gene is associated with gene expression. Proper regulation of DNA methylation is essential during development and aberrant DNA methylation is a hallmark of cancer. DNA methylation status is assayed at more than 500,000 CpG dinucleotides in the genome using Reduced Representation Bisulfite Sequencing (RRBS). Genomic DNA is digested with the methyl-insensitive restriction enzyme MspI, small genomic DNA fragments are purified by gel electrophoresis, and then used to construct an Illumina sequencing library. The library fragments are treated with sodium bisulfite and amplified by PCR to convert every unmethylated cytosine to a thymidine while leaving methylated cytosines intact. The sequenced fragments are aligned to a customized reference genome sequence and for each assayed CpG we report the number of sequencing reads covering that CpG and the percentage of those reads that are methylated. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf DNA methylation at CpG sites was assayed with a modified version of Reduced Representation Bisulfite Sequencing (RRBS; Meissner et al., 2008). RRBS was performed on cell lines grown by many ENCODE production groups. The production group that grew the cells and isolated genomic DNA is indicated in the "obtainedBy" field of the metadata. When a cell type was provided by more than one lab, the data for the cells from only one lab are displayed in the table above. However, the data for every cell type from every lab is available from the Downloads page. RRBS was carried out by the Myers production group at the HudsonAlpha Institute for Biotechnology. Isolation of genomic DNA Genomic DNA is isolated from biological replicates of each cell line using the QIAGEN DNeasy Blood & Tissue Kit according to the instructions provided by the manufacturer. DNA concentrations for each genomic DNA preparation are determined using fluorescent DNA binding dye and a fluorometer (Invitrogen Quant-iT dsDNA High Sensitivity Kit and Qubit Fluorometer). Typically, 1 µg of DNA is used to make an RRBS library; however, we have also had success in making libraries with 200 ng genomic DNA from rare or precious samples. RRBS library construction and sequencing RRBS library construction starts with MspI digestion of genomic DNA , which cuts at every CCGG regardless of methylation status. Klenow exo- DNA Polymerase is then used to fill in the recessed end of the genomic DNA and add an adenosine as a 3prime overhang. Next, a methylated version of the Illumina paired-end adapters is ligated onto the DNA. Adapter ligated genomic DNA fragments between 105 and 185 basepairs are selected using agarose gel electrophoresis and Qiagen Qiaquick Gel Extraction Kit. The selected adapter-ligated fragments are treated with sodium bisulfite using the Zymo Research EZ DNA Methylation Gold Kit, which converts unmethylated cytosines to uracils and leaves methylated cytosines unchanged. Bisulfite treated DNA is amplified in a final PCR reaction which has been optimized to uniformly amplify diverse fragment sizes and sequence contexts in the same reaction. During this final PCR reaction uracils are copied as thymines resulting in a thymine in the PCR products wherever an unmethylated cytosine existed in the genomic DNA. The sample is now ready for sequencing on the Illumina sequencing platform. These libraries were sequenced with an Illumina Genome Analyzer IIx according to the manufacturer's recommendations. Data analysis To analyze the sequence data, a reference genome is created that contains only the 36 base pairs adjacent to every MspI site and every C in those sequences is changed to T. A converted sequence read file is then created by changing each C in the original sequence reads to a T. The converted sequence reads are aligned to the converted reference genome, and only reads that map uniquely to the reference genome are kept. Once reads are aligned the percent methylation is calculated for each CpG using the original sequence reads. The percent methylation and number of reads is reported for each CpG.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Richard Sandstrom mailto:sull@u.washington.edu). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. This track shows DNaseI sensitivity measured genome-wide in different cell lines using the Digital DNaseI methodology (see below), and DNaseI hypersensitive sites. DNaseI has long been used to map general chromatin accessibility and DNaseI hypersensitivity is a universal feature of active cis-regulatory sequences. The use of this method has led to the discovery of functional regulatory elements that include enhancers, insulators, promotors, locus control regions and novel elements. For each experiment (cell type) this track shows DNaseI sensitivity as a continuous function using sequencing tag density (Raw Signal), and discrete loci of DNaseI sensitive zones (HotSpots) and hypersensitive sites (Peaks)." For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, isolating DNaseI 'double-hit' fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (36 bp reads). Uniquely mapping high-quality reads were mapped to the genome. DNaseI sensitivity is directly reflected in raw tag density (Raw Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI sensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). 1.0% false discovery rate thresholds (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within FDR 1.0% hypersensitive zones using a peak-finding algorithm.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Yijun Ruan mailto:ruanyj@gis.a-star.edu.sg). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track, produced as part of the ENCODE Project, contains deep sequencing DNase data that will be used to identify sites where regulatory factors bind to the genome (footprints). Footprinting is a technique used to define the DNA sequences that interact with and bind DNA-binding proteins, such as transcription factors, zinc-finger proteins, hormone-receptor complexes, and other chromatin-modulating factors like CTCF. The technique depends upon the strength and tight nature of protein-DNA interactions. In their native chromatin state, DNA sequences that interact directly with DNA-binding proteins are relatively protected from DNA degrading endonucleases, while the exposed/unbound portions are readily degraded by such endonucleases. A massively parallel next-generation sequencing technique to define the DNase hypersensitive sites in the genome was adopted. Sequencing these next-generation-sequencing DNase samples to significantly higher depths of 300-fold or greater produces a base-pair level resolution of the DNase susceptibility maps of the native chromatin state. These base-pair resolution maps represent and are dependent upon the nature and the specificity of interaction of the DNA with the regulatory/modulatory proteins binding at specific loci in the genome; thus they represent the native chromatin state of the genome under investigation. The deep sequencing approach has been used to define the footprint landscape of the genome by identifying DNA motifs that interact with known or novel DNA binding proteins. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, followed by isolating DNaseI 'double-hit' fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (27 bp reads). High-quality reads were mapped to the GRCh37/hg19 human genome using Bowtie 0.12.5 (Eland was used to map to NCBI36/hg18); only unique mappings were kept. DNaseI sensitivity is directly reflected in raw tag density (Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI hypersensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). False discovery rate thresholds of 1.0% (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36-mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within 1.0% (FDR 0.01) hypersensitive zones using a peak-finding algorithm. Only DNase Solexa libraries from unique cell types producing the highest quality data, as defined by Percent Tags in Hotspots (PTIH ~40%) were designated for deep sequencing to a depth of over 200 million tags.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Richard Sandstrom mailto:sull@u.washington.edu). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track, produced as part of the ENCODE Project, contains deep sequencing DNase data that will be used to identify sites where regulatory factors bind to the genome (footprints). Footprinting is a technique used to define the DNA sequences that interact with and bind DNA-binding proteins, such as transcription factors, zinc-finger proteins, hormone-receptor complexes, and other chromatin-modulating factors like CTCF. The technique depends upon the strength and tight nature of protein-DNA interactions. In their native chromatin state, DNA sequences that interact directly with DNA-binding proteins are relatively protected from DNA degrading endonucleases, while the exposed/unbound portions are readily degraded by such endonucleases. A massively parallel next-generation sequencing technique to define the DNase hypersensitive sites in the genome was adopted. Sequencing these next-generation-sequencing DNase samples to significantly higher depths of 300-fold or greater produces a base-pair level resolution of the DNase susceptibility maps of the native chromatin state. These base-pair resolution maps represent and are dependent upon the nature and the specificity of interaction of the DNA with the regulatory/modulatory proteins binding at specific loci in the genome; thus they represent the native chromatin state of the genome under investigation. The deep sequencing approach has been used to define the footprint landscape of the genome by identifying DNA motifs that interact with known or novel DNA binding proteins. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Digital DNaseI was performed by DNaseI digestion of intact nuclei, followed by isolating DNaseI 'double-hit' fragments as described in Sabo et al. (2006), and direct sequencing of fragment ends (which correspond to in vivo DNaseI cleavage sites) using the Solexa platform (27 bp reads). High-quality reads were mapped to the GRCh37/hg19 human genome using Bowtie 0.12.5 (Eland was used to map to NCBI36/hg18); only unique mappings were kept. DNaseI sensitivity is directly reflected in raw tag density (Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). DNaseI hypersensitive zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). False discovery rate thresholds of 1.0% (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36-mers. DNaseI hypersensitive sites (DHSs or Peaks) were identified as signal peaks within 1.0% (FDR 0.01) hypersensitive zones using a peak-finding algorithm. Only DNase Solexa libraries from unique cell types producing the highest quality data, as defined by Percent Tags in Hotspots (PTIH ~40%) were designated for deep sequencing to a depth of over 200 million tags.

Project description:RNA-seq is a method for mapping and quantifying the transcriptome of any organism that has a genomic DNA sequence assembly (Mortazavi et al., 2008). RNA-seq is performed by reverse-transcribing an RNA sample into cDNA, followed by high-throughput DNA sequencing, which was done here on the Illumina HiSeq sequencer. The transcriptome measurements shown on these tracks were performed on polyA selected RNA (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?term=longPolyA&type=rnaExtract) from total cellular RNA (http://hgwdev.cse.ucsc.edu/cgi-bin/hgEncodeVocab?term=cell&type=localization). PolyA-selected RNA was fragmented by magnesium-catalyzed hydrolysis and then converted into cDNA by random priming and amplified. Paired-end 2x100 bp reads were obtained from each end of a cDNA fragment. Reads were aligned to the mm9 human reference genome using TopHat (Trapnell et al., 2009), a program specifically designed to align RNA-seq reads and discover splice junctions de novo. All sequence and alignments files are available at http://hgwdev.cse.ucsc.edu/cgi-bin/hgFileUi?db=mm9&g=wgEncodeCaltechRnaSeq. Cells were grown according to the approved ENCODE cell culture protocols (http://hgwdev.cse.ucsc.edu/ENCODE/protocols/cell/mouse). Cells were lysed in RLT buffer (Qiagen RNEasy kit), and processed on RNEasy midi columns according to the manufacturer's protocol, with the inclusion of the "on-column" DNAse digestion step to remove residual genomic DNA. A quantity of 75 µgs of total RNA was selected twice with oligo-dT beads (Dynal) according to the manufacturer's protocol to isolate mRNA from each of the preparations. A quantity of 100 ngs of mRNA was then processed according to the protocol in Mortazavi et al. (2008), and prepared for sequencing on the Illumina GAIIx or HiSeq platforms according to the protocol for the ChIP-Seq DNA genomic DNA kit (Illumina). Paired-end libraries were size-selected around 200 bp (fragment length). Libraries were sequenced with the Illumina HiSeq according to the manufacturer's recommendations. Paired-end reads of 100 bp length were obtained. Reads were mapped to the reference mouse genome (version mm9 with or without the Y chromosome, depending on the sex of the cell line, and without the random chromosomes in all cases) using TopHat (version 1.3.1) (http://tophat.cbcb.umd.edu/). TopHat was used with default settings with the exception of specifying an empirically determined mean inner-mate distance and supplying known ENSEMBL version 63 splice junctions.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Richard Sandstrom mailto:sull@u.washington.edu). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). This track is produced as part of the ENCODE Project. This track displays maps of genome-wide binding of the CTCF transcription factor in different cell lines using ChIP-seq high-throughput sequencing For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf Cells were grown according to the approved ENCODE cell culture protocols. Cells were crosslinked with 1% formaldehyde, and the reaction was quenched by the addition of glycine. Fixed cells were rinsed with PBS, lysed in nuclei lysis buffer, and the chromatin was sheared to 200-500 bp fragments using Fisher Dismembrator (model 500). Sheared chromatin fragments were immunoprecipitated with specific polyclonal antibodies at 4 degrees C with gentle rotation. Antibody-chromatin complexes were washed and eluted. The cross linking in immunoprecipitated DNA was reversed and treated with RNase-A. Following proteinase K treatment, the DNA fragments were purified by phenol-chloroform-isoamyl alcohol extraction and ethanol precipitation. 20-50 ng of ChIP DNA was end-repaired, adenine ligated to Illumina adapters was added, and then a Solexa library was made for sequencing. ChIP-seq affinity is directly reflected in raw tag density (Raw Signal), which is shown in the track as density of tags mapping within a 150 bp sliding window (at a 20 bp step across the genome). ChIP-seq affinity zones (HotSpots) were identified using the HotSpot algorithm described in Sabo et al. (2004). 1.0% false discovery rate thresholds (FDR 0.01) were computed for each cell type by applying the HotSpot algorithm to an equivalent number of random uniquely mapping 36mers. ChIP-seq affinity (Peaks) were identified as signal peaks within FDR 1.0% hypersensitive zones using a peak-finding algorithm.

Project description:This data was generated by ENCODE. If you have questions about the data, contact the submitting laboratory directly (Jonathan Preall jpreall@cshl.edu (Generation 0 Data from Hannon Lab), Carrie Davis davisc@cshl.edu (experimental), Alex Dobin dobin@cshl.edu (computational), Wei Lin wlin@cshl.edu (computational), Tom Gingeras gingeras@cshl.edu (primary investigator)). If you have questions about the Genome Browser track associated with this data, contact ENCODE (mailto:genome@soe.ucsc.edu). hg18: This data was produced by Hannon lab part of Cold Spring Harbor as part of the ENCODE Project. The series depicts NextGen sequencing information for RNAs between the sizes of 20-200 nt isolated from RNA samples from tissues or sub cellular compartments of cell lines. hg19: This track depicts NextGen sequencing information for RNAs between the sizes of 20-200 nt isolated from RNA samples from tissues or sub cellular compartments from ENCODE cell lines. The overall goal of the ENCODE project is to identify and characterize all functional elements in the sequence of the human genome. hg19: This cloning protocol generates directional libraries that are read from the 5' ends of the inserts, which should largely correspond to the 5' ends of the mature RNAs. The libraries were sequenced on a Solexa platform for a total of 36, 50 or 76 cycles however the reads undergo post-processing resulting in trimming of their 3' ends. Consequently, the mapped read lengths are variable. For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf hg18: Small RNAs between 20-200 nt were ribominus treated according to the manufacturer's protocol (Invitrogen) using custom LNA probes targeting ribosomal RNAs (some datasets are also depleted of U snRNAs and high abundant microRNAs). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5' cap structure. Poly-A Polymerase was used to catalyze the addition of C's to the 3' end. The 5' ends were phosphorylated using T4 PNK and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension and containing sequencing adapters. The library was sequenced on an Illumina GA machine for a total of 36, 50 or 76 cycles. Initially 1 lane is run. If an appreciable number of mappable reads are obtained, additional lanes are run. Sequence reads underwent quality filtration using Illumina standard pipeline (Gerlad). The read lengths may exceed the insert sizes and consequently introduce 3' adaptor sequence into the 3' end of the reads. The 3' sequencing adaptor was removed from the reads using a custom clipper program, which aligned the adaptor sequence to the short-reads, allowing up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (NCBI build 36, hg18 unmasked) using Nexalign (Lassmann et al., not published). The alignment parameters are tuned to tolerate up to 2 mismatches with no indels and will allow for trimmed portions as small as 5 nucleotides to be mapped. We report reads that mapped 10 or fewer times. Data obtained from each lane is processed and mapped independently. The processed/mapped data from each lane is then complied as a single track without additional processing and submitted to UCSC. Consequently, identical reads within a lane were collapsed and their value is reported as the "transfrag" signal value. However, the redundancy between lanes has not been eliminated so the same transfrag may appear multiple times within a signal. hg19: Small RNAs between 20-200 nt were ribominus treated according to the manufacturer's protocol (Invitrogen) using custom LNA probes targeting ribosomal RNAs (some datasets are also depleted of U snRNAs and high abundant microRNAs). The RNA was treated with Tobacco Alkaline Pyrophosphatase to eliminate any 5' cap structures. Poly-A Polymerase was used to catalyze the addition of C's to the 3' end. The 5' ends were phosphorylated using T4 PNK and an RNA linker was ligated onto the 5' end. Reverse transcription was carried out using a poly-G oligo with a defined 5' extension. The inserts were then amplified using oligos targeting the 5' linker and poly-G extension and containing sequencing adapters. The library was sequenced on an Illumina GA machine for a total of 36, 50 or 76 cycles. Initially, one lane was run. If an appreciable number of mappable reads were obtained, additional lanes were run. Sequence reads underwent quality filtration using Illumina standard pipeline (GERALD). The Illumina reads were initially trimmed to discard any bases following a quality score less than or equal to 20 and converted into FASTA format, thereby discarding quality information for the rest of the pipeline. As a result, the sequence quality scores in the BAM output are all displayed as "40" to indicate no quality information. The read lengths may exceed the insert sizes and consequently introduce 3' adapter sequence into the 3' end of the reads. The 3' sequencing adapter was removed from the reads using a custom clipper program (available at http://hannonlab.cshl.edu/fastx_toolkit/), which aligned the adapter sequence to the short-reads using up to 2 mismatches and no indels. Regions that aligned were "clipped" off from the read. Terminal C nucleotides introduced at the 3' end of the RNA via the cloning procedure are also trimmed. The trimmed portions were collapsed into identical reads, their count noted and aligned to the human genome (version hg19, using the gender build appropriate to the sample in question - female/male) using Bowtie (Langmead B. et al). The alignment parameter allowed 0, 1, or 2 mismatches iteratively. We report reads that mapped 20 or fewer times. Discrepancies between hg18 and hg19 versions of CSHL small RNA data: The alignment pipeline for the CSHL small RNA data was updated upon the release of the human genome version hg19, resulting in a few noteworthy discrepancies with the hg18 dataset. First, mapping was conducted with the open-source Bowtie algorithm (http://bowtie-bio.sourceforge.net/index.shtml) rather than the custom NexAlign software. As each algorithm uses different strategies to perform alignments, the mapping results may vary even in genomic regions that do not differ between builds. The read processing pipeline also varies slightly, in that we no longer retain information regarding whether a read was 'clipped' off adapter sequence.

Dataset Information

ENCODE Caltech RNA-seq

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets