Project description:With an ability to compromise genome integrity, transposable elements (TEs) have significant associations with human diseases. Short-read sequencing has been used to study the expression of TEs; however, the highly repetitive nature of these elements makes multimapping a critical issue. Here we implement LocusMasterTE, an improved quantification method by integrating long-read sequencing. Introducing computed transcript per million(TPM) counts from long-read sequencing as prior distribution during Expectation-Maximization(EM) model in short-read TE quantification, multi-mapped reads are re-assigned to correct expression values. Based on simulated short reads, LocusMasterTE outperforms current quantitative approaches and is significantly favorable in capturing newly inserted TEs. We also verified that TEs quantified by LocusMasterTE clearly related to euchromatins and heterochromatins in cell line samples. With LocusMasterTE we anticipate that more accurate quantification can be performed, allowing novel functions of TEs to be uncovered.
Project description:With an ability to compromise genome integrity, transposable elements (TEs) have significant associations with human diseases. Short-read sequencing has been used to study the expression of TEs; however, the highly repetitive nature of these elements makes multimapping a critical issue. Here we implement LocusMasterTE, an improved quantification method by integrating long-read sequencing. Introducing computed transcript per million(TPM) counts from long-read sequencing as prior distribution during Expectation-Maximization(EM) model in short-read TE quantification, multi-mapped reads are re-assigned to correct expression values. Based on simulated short reads, LocusMasterTE outperforms current quantitative approaches and is significantly favorable in capturing newly inserted TEs. We also verified that TEs quantified by LocusMasterTE clearly related to euchromatins and heterochromatins in cell line samples. With LocusMasterTE we anticipate that more accurate quantification can be performed, allowing novel functions of TEs to be uncovered.
Project description:Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each one displayed distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than with annotated ones. These data show that TALON is a technology-agnostic long-read transcriptome discovery and quantification pipeline capable of tracking both known and novel transcript models, as well as their expression levels, across datasets for both simple studies and in larger projects. These properties will enable TALON users to move beyond the limitations of short-read data to perform isoform discovery and quantification in a uniform manner on existing and future long-read platforms.
Project description:Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each technology has its distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than annotated ones. The TALON pipeline for technology-agnostic, long-read transcriptome discovery and quantification tracks both known and novel transcript models as well as expression levels across datasets for both simple studies and larger projects such as ENCODE that seek to decode transcriptional regulation in the human and mouse genomes to predict more accurate expression levels of genes and transcripts than possible with short-reads alone.
Project description:Accurate quantification of transcript isoforms is crucial for understanding gene regulation, functional diversity, and cellular behavior. Existing methods using either short-read (SR) or long-read (LR) RNA sequencing have significant limitations: SR sequencing provides high depth but struggles with isoform deconvolution, while LR sequencing offers isoform resolution at the cost of lower depth, higher noise, and technical biases. Addressing this gap, we introduce Multi-Platform Aggregation and Quantification of Transcripts (MPAQT), a generative model that combines the complementary strengths of different sequencing platforms to achieve state-of-the-art isoform-resolved transcript quantification, as demonstrated by extensive simulations and experimental benchmarks. Applying MPAQT to an in vitro model of human embryonic stem cell differentiation into cortical neurons, followed by machine learning-based modeling of mRNA abundance determinants, reveals the role of untranslated regions (UTRs) in isoform regulation through isoform-specific interactions with RNA-binding proteins that modulate mRNA stability. These findings highlight MPAQT's potential to enhance our understanding of transcriptomic complexity and underline the role of splicing-independent post-transcriptional mechanisms in shaping the isoform and exon usage landscape of the cell.
Project description:Long-read RNA sequencing (RNA-seq) holds great potential for characterizing transcriptome variation and full-length transcript isoforms, but the relatively high error rate of current long-read sequencing platforms poses a major challenge. We present ESPRESSO, a computational tool for robust discovery and quantification of transcript isoforms from error-prone long reads. ESPRESSO jointly considers alignments of all long reads aligned to a gene and uses error profiles of individual reads to improve the identification of splice junctions and the discovery of their corresponding transcript isoforms. On both a synthetic spike-in RNA sample and human RNA samples, ESPRESSO outperforms multiple contemporary tools in not only transcript isoform discovery but also transcript isoform quantification. In total, we generated and analyzed ~1.1 billion nanopore RNA-seq reads covering 30 human tissue samples and three human cell lines. ESPRESSO and its companion dataset provide a useful resource for studying the RNA repertoire of eukaryotic transcriptomes.
Project description:Accurate quantification of transcript isoforms is crucial for understanding gene regulation, functional diversity, and cellular behavior. Existing methods using either short-read (SR) or long-read (LR) RNA sequencing have significant limitations: SR sequencing provides high depth but struggles with isoform deconvolution, while LR sequencing offers isoform resolution at the cost of lower depth, higher noise, and technical biases. Addressing this gap, we introduce Multi-Platform Aggregation and Quantification of Transcripts (MPAQT), a generative model that combines the complementary strengths of different sequencing platforms to achieve state-of-the-art isoform-resolved transcript quantification, as demonstrated by extensive simulations and experimental benchmarks. Applying MPAQT to an in vitro model of human embryonic stem cell differentiation into cortical neurons, followed by machine learning-based modeling of mRNA abundance determinants, reveals the role of untranslated regions (UTRs) in isoform regulation through isoform-specific interactions with RNA-binding proteins that modulate mRNA stability. These findings highlight MPAQT's potential to enhance our understanding of transcriptomic complexity and underline the role of splicing-independent post-transcriptional mechanisms in shaping the isoform and exon usage landscape of the cell.
Project description:Ongoing improvements to next generation sequencing technologies are leading to longer sequencing read lengths, but a thorough understanding of the impact of longer reads on RNA sequencing analyses is lacking. To address this issue, we generated and compared two RNA sequencing datasets of differing read lengths -- 2x75 bp (L75) and 2x262 bp (L262) -- and investigated the impact of read length on various aspects of analysis, including the performance of currently available read-mapping tools, gene and transcript quantification, and detection of allele-specific expression patterns. Our results indicate that, while the scalability of read-mapping tools and the cost-effectiveness of long read protocol is an issue that requires further attention, longer reads enable more accurate quantification of diverse aspects of gene expression, including individual-specific patterns of allele-specific expression and alternative splicing. Two RNA-Seq datasets of differing read lengths (2x262 bp and 2x75 bp)