High-throughput manual-quality annotation of full-length long noncoding RNAs with Capture Long-Read Sequencing (CLS)
Ontology highlight
ABSTRACT: Accurate annotations of genes and their transcripts is a foundation of genomics, but no annotation technique presently combines throughput and accuracy. As a result, the GENCODE reference collection of long noncoding RNAs remains far from complete: many are fragmentary, while thousands more remain uncatalogued. To accelerate lncRNA annotation, we have developed RNA Capture Long Seq (CLS), combining targeted RNA capture with third generation long-read sequencing. We present an experimental re-annotation of the entire GENCODE intergenic lncRNA populations in matched human and mouse tissues. CLS approximately doubles the complexity of targeted loci, both in terms of validated splice junctions and transcript models. Through its identification of full-length transcript models, CLS allows the first definitive measurement of promoter features, gene structure and protein-coding potential of lncRNAs. Thus CLS removes a longstanding bottleneck of transcriptome annotation, generating manual-quality full-length transcript models at high-throughput scales.
Project description:The GENCODE project is a long-term international effort to produce a comprehensive and accurate map of genes and transcripts for the human and mouse genomes. While the annotation of protein-coding genes is nearly complete, long non-coding RNAs (lncRNAs) remain poorly characterized, with existing catalogs lacking consistency and experimental support. To address this, GENCODE used a targeted RNA sequencing approach to capture RNA from various human and mouse tissues, employing advanced sequencing technologies (ONT, PacBio, and Illumina). This resulted in the prediction of around half a million transcript models for both species. GENCODE then re-engineered its curation pipeline to handle this data, leading to the annotation of 16,817 new human genes (132,049 transcripts) and 22,210 new mouse genes (131,546 transcripts)—a significant increase in lncRNA annotations. The newly identified genes and transcripts have similar features to previously annotated lncRNAs and are linked to human phenotypes through GWAS and evolutionary conservation. Furthermore, the project has expanded the map of lncRNA orthology between humans and mice, especially for disease-associated lncRNAs. These updates enhance the functional interpretation of the human genome, connecting millions of previously unassigned omics data points (e.g., CAGE tags, ChIP-Seq peaks, genetic variants) to specific transcriptional units and regulatory regions. This marks a significant advancement toward a complete lncRNA catalog for human and mouse genomes.
Project description:As part of the ENCODE consortium the GENCODE project is producing a reference gene set through manual and automated gene prediction. Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing. In batch IX, a set of de novo transcript models was tested aiming to incorporate new long non-coding RNA models into the GENCODE annotation. The original set was built with Cufflinks from ENCODE RNAseq data derived from 15 cell lines by the Gingeras (CSHL) and Wold (CalTech) labs. A subset of multiexonic transcripts not overlapping the GENCODE v10 annotation was selected for this experiment.
Project description:As part of the ENCODE consortium the GENCODE project is producing a reference gene set through manual and automated gene prediction. Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing. This is batch II, based on annotation from April 2009.
Project description:Transcription profiling by high throughput sequencing of polyA+ RNAs from eight different human tissues to test a set of de novo transcript models (GENCODE PCR-Seq Batch IX) As part of the ENCODE consortium the GENCODE project is producing a reference gene set through manual and automated gene prediction. Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing. In batch IX, a set of de novo transcript models was tested aiming to incorporate new long non-coding RNA models into the GENCODE annotation. The original set was built with Cufflinks from ENCODE RNAseq data derived from 15 cell lines by the Gingeras (CSHL) and Wold (CalTech) labs. A subset of multiexonic transcripts not overlapping the GENCODE v10 annotation was selected for this experiment. ArrayExpress Release Date: 2012-10-01 Person Roles: submitter Person Last Name: Gonzalez Person First Name: Jose Person Mid Initials: M Person Email: jmg@sanger.ac.uk Person Phone: -498006 Person Address: Wellcome Trust Genome Campus, Hinxton, UK Person Affiliation: Wellcome Trust Sanger Institute Person Roles: investigator Person Last Name: Hubbard Person First Name: Tim Person Mid Initials: Person Email: th@sanger.ac.uk Person Phone: -498055 Person Address: Wellcome Trust Genome Campus, Hinxton, UK Person Affiliation: Wellcome Trust Sanger Institute Person Roles: investigator Person Last Name: Reymond Person First Name: Alexandre Person Mid Initials: Person Email: Alexandre.Reymond@unil.ch Person Phone: Person Address: Lausanne, Switzerland Person Affiliation: University of Lausanne Person Roles: investigator Person Last Name: Guigo Person First Name: Roderic Person Mid Initials: Person Email: roderic.guigo@crg.cat Person Phone: Person Address: Barcelona, Spain Person Affiliation: Centre for Genomic Regulation (CRG) For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf
Project description:Transcription profiling by high throughput sequencing of polyA+ RNAs from eight different human tissues to test a set of de novo transcript models (GENCODE PCR-Seq Batch IX) As part of the ENCODE consortium the GENCODE project is producing a reference gene set through manual and automated gene prediction. Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing. In batch IX, a set of de novo transcript models was tested aiming to incorporate new long non-coding RNA models into the GENCODE annotation. The original set was built with Cufflinks from ENCODE RNAseq data derived from 15 cell lines by the Gingeras (CSHL) and Wold (CalTech) labs. A subset of multiexonic transcripts not overlapping the GENCODE v10 annotation was selected for this experiment. ArrayExpress Release Date: 2012-10-01 Person Roles: submitter Person Last Name: Gonzalez Person First Name: Jose Person Mid Initials: M Person Email: jmg@sanger.ac.uk Person Phone: -498006 Person Address: Wellcome Trust Genome Campus, Hinxton, UK Person Affiliation: Wellcome Trust Sanger Institute Person Roles: investigator Person Last Name: Hubbard Person First Name: Tim Person Mid Initials: Person Email: th@sanger.ac.uk Person Phone: -498055 Person Address: Wellcome Trust Genome Campus, Hinxton, UK Person Affiliation: Wellcome Trust Sanger Institute Person Roles: investigator Person Last Name: Reymond Person First Name: Alexandre Person Mid Initials: Person Email: Alexandre.Reymond@unil.ch Person Phone: Person Address: Lausanne, Switzerland Person Affiliation: University of Lausanne Person Roles: investigator Person Last Name: Guigo Person First Name: Roderic Person Mid Initials: Person Email: roderic.guigo@crg.cat Person Phone: Person Address: Barcelona, Spain Person Affiliation: Centre for Genomic Regulation (CRG) For data usage terms and conditions, please refer to http://www.genome.gov/27528022 and http://www.genome.gov/Pages/Research/ENCODE/ENCODEDataReleasePolicyFinal2008.pdf
Project description:As part of the ENCODE consortium the GENCODE project is producing a reference gene set through manual and automated gene prediction. Selected transcript models are verified experimentally by RT-PCR amplification followed by sequencing. This is the RNASeq part of batch I, based on annotation from July 2008 (without pseudogenes).
Project description:Long non-coding RNAs (lncRNA) constitute a large fraction of mammalian transcriptomes that still remains unexplored, mainly due to the lack of comprehensive, high-quality lncRNA annotation that limits the possibility to fully explore their functional capacity. We have developed RACE-seq, an experimental workflow based on RACE (Rapid Amplification of cDNA Ends) and long read RNA sequencing, aimed at both rare isoform discovery and better definition of gene boundaries. We applied 3â and 5â RACE-seq on 398 low-expressed GENCODE v7 lncRNA genes in seven human tissues (brain, testis, heart, kidney, liver, lung and spleen). The sequences obtained led to the discovery of 2,641 on-target, previously unknown alternative transcripts. Novel isoforms extended 60% of the 398 targeted lncRNA loci further in either 5' or 3', and often reached genome hallmarks typical of gene boundaries. In parallel, we used nested RACE-seq, and confirmed that nested RACE-seq has overwhelmingly better sensitivity than its standard counterpart.
Project description:As part of the ENCODE consortium the GENCODE project is producing a reference gene set through manual and automated gene prediction. In the current phase of ENCODE we have found strong evidence that many lncRNAs transcript termini are still unknown. This experiment aims to set up an experimental validation strategy to accurately determine the 5' and 3' ends of transcripts, which is based on semi-nested RACE extensions of annotated 5' and 3' ends followed by high throughput sequencing. A total of 400 highly expressed lncRNA transcript models from Gencode 7 which did not have any CAGE/PET support were selected as the test set whereas 25 transcripts with transcript start site (TSS) supported by CAGE tags and transcript termination site (TTS) supported by PET ditags formed the positive control set. Transcript ends were amplified by RACE-PCR from brain and testis RNA samples and sequenced using the Roche 454 platform. The sequencing was performed at the Andalusian Human Genome Sequencing Centre (CASEGH), Seville, Spain.
Project description:As part of the ENCODE consortium the GENCODE project is producing a reference gene set through manual and automated gene prediction. In the current phase of ENCODE we have found strong evidence that many lncRNAs transcript termini are still unknown. This experiment aims to set up an experimental validation strategy to accurately determine the 5' and 3' ends of transcripts, which is based on semi-nested RACE extensions of annotated 5' and 3' ends followed by high throughput sequencing. A total of 400 highly expressed lncRNA transcript models from Gencode 7 which did not have any CAGE/PET support were selected as the test set whereas 25 transcripts with transcript start site (TSS) supported by CAGE tags and transcript termination site (TTS) supported by PET ditags formed the positive control set. Transcript ends were amplified by RACE-PCR from brain and testis RNA samples and sequenced using the Roche 454 platform. The sequencing was performed at the Centre for Genomic Regulation (CRG), Barcelona, Spain.