High-confidence Coding and Noncoding Transcriptome Maps
Ontology highlight
ABSTRACT: The advent of high-throughput RNA sequencing (RNA-seq) has led to the discovery of unprecedentedly immense transcriptomes encoded by eukaryotic genomes. However, the transcriptome maps are still incomplete partly because they were mostly reconstructed based on RNA-seq reads that lack their orientations (known as unstranded reads) and certain boundary information. Methods to expand the usability of unstranded RNA-seq data by predetermining the orientation of the reads and precisely determining the boundaries of assembled transcripts could significantly benefit the quality of the resulting transcriptome maps. Here, we present a high-performing transcriptome assembly pipeline, called CAFE, that significantly improves the original assemblies, respectively assembled with stranded and/or unstranded RNA-seq data, by orienting unstranded reads using the maximum likelihood estimation and by integrating information about transcription start sites and cleavage and polyadenylation sites. Applying large-scale transcriptomic data comprising 230 billion RNA-seq reads from the ENCODE, Human BodyMap Projects, The Cancer Genome Atlas, and GTEx, CAFE enabled us to predict the directions of about 220 billion unstranded reads, which led to the construction of more accurate transcriptome maps, comparable to the manually curated map, and a comprehensive lncRNA catalogue that includes thousands of novel lncRNAs. Our pipeline should not only help to build comprehensive, precise transcriptome maps from complex genomes but also to expand the universe of non-coding genomes.
Project description:The advent of high-throughput RNA sequencing (RNA-seq) has led to the discovery of unprecedentedly immense transcriptomes encoded by eukaryotic genomes. However, the transcriptome maps are still incomplete partly because they were mostly reconstructed based on RNA-seq reads that lack their orientations (known as unstranded reads) and certain boundary information. Methods to expand the usability of unstranded RNA-seq data by predetermining the orientation of the reads and precisely determining the boundaries of assembled transcripts could significantly benefit the quality of the resulting transcriptome maps. Here, we present a high-performing transcriptome assembly pipeline, called CAFE, that significantly improves the original assemblies, respectively assembled with stranded and/or unstranded RNA-seq data, by orienting unstranded reads using the maximum likelihood estimation and by integrating information about transcription start sites and cleavage and polyadenylation sites. Applying large-scale transcriptomic data comprising 230 billion RNA-seq reads from the ENCODE, Human BodyMap Projects, The Cancer Genome Atlas, and GTEx, CAFE enabled us to predict the directions of about 220 billion unstranded reads, which led to the construction of more accurate transcriptome maps, comparable to the manually curated map, and a comprehensive lncRNA catalogue that includes thousands of novel lncRNAs. Our pipeline should not only help to build comprehensive, precise transcriptome maps from complex genomes but also to expand the universe of non-coding genomes. This SuperSeries is composed of the SubSeries listed below.
Project description:We want to develop transcriptome assembly pipeline that significantly improves the quality of the assemblies constructed using stranded and/or unstranded RNA-seq data. Transcriptome of mouse embryonic stem cells (mESC) were assembled using stranded and unstranded library generated by Illumina HiSeq 2000
Project description:We want to develop transcriptome assembly pipeline that significantly improves the quality of the assemblies constructed using stranded and/or unstranded RNA-seq data.
Project description:The Yeonsan Ogye (Ogye) is the rare black chicken breed domesticated in Korean peninsula, which has been noted for entire black color upon its appearances including feather, skin, comb, eyes, shank, claws and internal organs. In this study, whole genome, transcriptome and epigenome sequencings of Ogye were performed using high-throughput NGS sequencing platforms. We have produced Illumina short-reads (Paired-End, Mate-Pair and FOSMID) and PacBio long-reads for whole genome sequencing (WGS), 1.4 billion reads for RNA-seq, and 123 million reads for RRBS (reduced representation bisulfite sequencing) data. Using WGS data, Ogye genome has been assembled, and coding/non-coding transcriptome maps were constructed on Ogye genome given largescale sequencing data. We have predicted 17,472 (3,550 newly annotated and 13,922 known) protein-coding transcripts, and 9,443 (6,689 novel and 2,754 known) long non-coding RNAs (lncRNAs).
Project description:The Yeonsan Ogye (Ogye) is the rare black chicken breed domesticated in Korean peninsula, which has been noted for entire black color upon its appearances including feather, skin, comb, eyes, shank, claws and internal organs. In this study, whole genome, transcriptome and epigenome sequencings of Ogye were performed using high-throughput NGS sequencing platforms. We have produced Illumina short-reads (Paired-End, Mate-Pair and FOSMID) and PacBio long-reads for whole genome sequencing (WGS), 1.4 billion reads for RNA-seq, and 123 million reads for RRBS (reduced representation bisulfite sequencing) data. Using WGS data, Ogye genome has been assembled, and coding/non-coding transcriptome maps were constructed on Ogye genome given largescale sequencing data. We have predicted 17,472 (3,550 newly annotated and 13,922 known) protein-coding transcripts, and 9,443 (6,689 novel and 2,754 known) long non-coding RNAs (lncRNAs).
Project description:Discovery of genes driving axolotl limb regeneration has been challenging due to limited genomic resources. We assembled 42 RNA-Seq samples totaling approximately 1.3 billion 100 base paired-end reads using Trinity (Grabherr M.G. et al, Nature Biotechnology, 2011; Haas B.J. et al, Nature Protocols, 2013): https://github.com/trinityrnaseq/trinityrnaseq/wiki). We created a transcriptome with complete sequence information for most axolotl genes, identified transcriptional profiles that distinguish blastemas from differentiated limb tissues, and uncovered functional roles for cirbp and kazald1 in limb regeneration.
Project description:Purpose: The goals of this study are to compare the gene expression profiling for drought treated and control plants by using NGS. Methods: The four RNA samples were pooled to one, using equivalent quantities of each sample for transctiptome sequencing. Meanwhile, the four RNA samples were used to construct the library for DGE sequencing. Results: Using Illumina sequencing technology, we generated over two billion bases of high-quality sequence data on H. ammodendron and conducted de novo assembly and annotation of genes without prior genome information. These reads were assembled into 79,918 unigenes (mean length=728 bp).In addition, DGE reads were mapped to the assembled transcriptome for gene expression analysis under drought stress. In total, 1,060 differentially expressed genes were identified.
Project description:Purpose: The goals of this study are to compare the gene expression profiling for drought treated and control plants by using NGS. Methods: The four RNA samples were pooled to one, using equivalent quantities of each sample for transctiptome sequencing. Meanwhile, the four RNA samples were used to construct the library for DGE sequencing. Results: Using Illumina sequencing technology, we generated over two billion bases of high-quality sequence data on H. ammodendron and conducted de novo assembly and annotation of genes without prior genome information. These reads were assembled into 79,918 unigenes (mean length=728 bp).In addition, DGE reads were mapped to the assembled transcriptome for gene expression analysis under drought stress. In total, 1,060 differentially expressed genes were identified. H. ammodendron seedlings grew for one month, and then one set of seedlings were treated with a one-week (7d) stress, and the second set of seedlings was used as a control and received no treatment. Each treatment was with two replicates.
Project description:In Europe, ticks are the most important vectors of diseases threatening humans, livestock, wildlife and companion animals. Nevertheless, genomic sequence information and functional annotation of proteins of the most important European tick, Ixodes ricinus, is limited. Here we present the first analysis of the I. ricinus genome and of the transcriptome of the unfed I. ricinus midgut. We combined and integrated data from genome, transcriptome and proteome. The de novo assembly of 1 billion paired-end sequences identified 6,415 putative genes providing an unprecedented insight into the I. ricinus genome. Mapping of our midgut mRNA reads to the assembled contigs let us estimate to cover around two third of the unique genomic sequences. In addition, more than 10,000 transcripts from naïve midgut were annotated functionally and/or locally. By combining the alignment-based with a motif-search based annotation approach, we could double the number of annotations throughout all groups without shifting the dataset. Moreover, 1,175 proteins expressed in the naïve midgut were identified by mass spectrometry confirming the high completeness of our transcriptome database, and 608 were significantly annotated for function and/or localization. This multiple-omics study vastly extends the publicly available DNA, RNA and protein databases for I. ricinus and ticks in general.
Project description:This experiment contains the subset of data corresponding to rhesus macaque RNA-Seq data from experiment E-GEOD-30352 (http://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-30352/), which goal is to understand the dynamics of mammalian transcriptome evolution. To study mammalian transcriptome evolution at high resolution, we generated RNA-Seq data (∼3.2 billion Illumina Genome Analyser IIx reads of 76 base pairs) for the polyadenylated RNA fraction of brain (cerebral cortex or whole brain without cerebellum), cerebellum, heart, kidney, liver and testis (usually from one male and one female per somatic tissue and two males for testis) from nine mammalian species: placental mammals (great apes, including humans; rhesus macaque; mouse), marsupials (gray short-tailed opossum) and monotremes (platypus). Corresponding data (∼0.3 billion reads) were generated for a bird (red jungle fowl, a non-domesticated chicken) and used as an evolutionary outgroup.