Project description:Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each one displayed distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than with annotated ones. These data show that TALON is a technology-agnostic long-read transcriptome discovery and quantification pipeline capable of tracking both known and novel transcript models, as well as their expression levels, across datasets for both simple studies and in larger projects. These properties will enable TALON users to move beyond the limitations of short-read data to perform isoform discovery and quantification in a uniform manner on existing and future long-read platforms.
Project description:New tools for improved long-read transcript assembly and coalescence with its short-read counterpart are required. Using our short- and long-read measurements from different cell lines with spiked-in standards, we systematically compared key parameters and biases in the read alignment and assembly of transcripts. We report a cDNA synthesis artifact in long-read datasets that impacts the identity and quantitation of assembled transcripts. We developed a computational pipeline to strand long-read cDNA libraries that markedly improves assembly of transcripts from long-reads. Incorporating stranded long-reads in a new hybrid assembly approach, we demonstrate its efficacy for improved characterization of challenging lncRNA transcripts. Our workflow can be applied to a wide range of transcriptomics datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts.
Project description:New tools for improved long-read transcript assembly and coalescence with its short-read counterpart are required. Using our short- and long-read measurements from different cell lines with spiked-in standards, we systematically compared key parameters and biases in the read alignment and assembly of transcripts. We report a cDNA synthesis artifact in long-read datasets that impacts the identity and quantitation of assembled transcripts. We developed a computational pipeline to strand long-read cDNA libraries that markedly improves assembly of transcripts from long-reads. Incorporating stranded long-reads in a new hybrid assembly approach, we demonstrate its efficacy for improved characterization of challenging lncRNA transcripts. Our workflow can be applied to a wide range of transcriptomics datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts.
Project description:Long-read RNA sequencing (RNA-seq) holds great potential for characterizing transcriptome variation and full-length transcript isoforms, but the relatively high error rate of current long-read sequencing platforms poses a major challenge. We present ESPRESSO, a computational tool for robust discovery and quantification of transcript isoforms from error-prone long reads. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms. On both a synthetic spike-in RNA sample and human RNA samples, ESPRESSO outperforms multiple contemporary tools in not only transcript isoform discovery but also transcript isoform quantification. In total, we generated and analyzed ~1.1 billion nanopore RNA-seq reads covering 30 human tissue samples and three human cell lines. ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.
Project description:To identify aberrant splicing isoforms and potential neoantigens, we performed full-length cDNA sequencing of lung adenocarcinoma cell lines using a long-read sequencer MinION. We constructed a comprehensive catalog of aberrant splicing isoforms and detected isoform-specific peptides using proteome analysis.
Project description:Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each technology has its distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than annotated ones. The TALON pipeline for technology-agnostic, long-read transcriptome discovery and quantification tracks both known and novel transcript models as well as expression levels across datasets for both simple studies and larger projects such as ENCODE that seek to decode transcriptional regulation in the human and mouse genomes to predict more accurate expression levels of genes and transcripts than possible with short-reads alone.
Project description:In this study, we used a barcoding-based synthetic long read (SLR) isoform sequencing approach (LoopSeq) to generate sequencing reads sufficiently long and accurate to identify isoforms using standard short read Illumina sequencers.
Project description:In this study, we used a barcoding-based synthetic long read (SLR) isoform sequencing approach (LoopSeq) to generate sequencing reads sufficiently long and accurate to identify isoforms using standard short read Illumina sequencers.