Project description:We used PacBio data to identify more reliable transcripts from hESC, based on which we can estimate gene/transcript abundance better from Illumina data. PacBio long reads and Illumina short reads were generated from the same hESC cell line H1. PacBio reads were error-corrected by Illumina reads to identify transcripts. rSeq is used to estimate gene/transcript abundance of the identified transcriptome.
Project description:Genome-wide association studies (GWASs) have revealed thousands of associations in many complex traits and diseases. Previous studies suggest that a subset of associations are due to alterations in splicing; however, interpreting the effects of splicing on protein isoforms is hindered by limitations in defining full-length transcript isoforms using short-read RNA-seq data. Long-read RNA-seq represents a powerful approach to define and quantify transcript isoforms. In this study, we developed a novel approach that integrates information from GWAS, splicing QTL (sQTL), and PacBio long-read RNA-seq in a disease relevant model to infer the effects of sQTL on the ultimate protein isoform products they encode. Such information enables identification of genes potentially responsible for GWAS associations. As a proof-of-concept, we generated deep coverage (N=~22 million full-length reads) PacBio long-read RNAseq data on human fetal osteoblasts (hFOBs), a cell-line of relevance to the regulation of bone mineral density (BMD). We identified 68,326 protein-coding isoforms, including 17,375 (25%) which were novel. Next, we used Bayesian colocalization to identify 1,863 sQTLs from the Genotype-Tissue Expression (GTEx) project in 732 protein-coding genes which colocalized with BMD associations (H4PP > 0.75). A total of 836 junctions with colocalizing sQTLs in 459 (of the 732) genes were expressed in hFOB long-read RNA-seq data. With these data, we formulated hypotheses regarding the potential mechanism of action of each sQTL. For example, we identified 7 junctions with colocalizing sQTLs (maximum H4PP = 0.98-0.99) in TPM2 for splice junctions between two nearly mutually exclusive exons, and two different transcript termination sites, making it impossible to interpret without long-read RNA-seq data. siRNA mediated knockdown in hFOBs showed two TPM2 isoforms with opposing effects on mineralization. Our results suggest that splicing is a major mechanism underlying GWAS associations and long-read proteogenomics data is critical to precisely define the protein isoforms that are produced from splicing alterations.
Project description:The human neural retina is enriched for alternative splicing, and it is estimated that more than 10% of variants associated with inherited retinal diseases (IRDs) alter splicing. Previous research mainly used short-read RNA-sequencing techniques to investigate retina-specific splicing and splicing factors. However, this technique provides limited information about transcript isoforms. To gain a deeper understanding of the human neural retina and its isoforms, we generated a proteogenomic atlas that combined PacBio long-read RNA-sequencing data with mass-spectrometry and whole-genome sequencing data from three healthy human neural retina samples. RNA-sequencing revealed that one-third of all transcripts were novel, and for IRD-associated genes, even 43% were novel. The most common novel elements of these transcripts were alternative poly(A) sites, exon elongation, and intron retention. Some novel elements affect the non-coding region but for more than 50% of the novel transcripts a novel open reading frame was predicted. Using proteomics, ten novel peptides confirmed novel isoforms in five genes. Additionally, we found novel isoforms of IMPDH1, an IRD-associated gene, with supporting peptide evidence. This study provides a comprehensive overview of the transcript and protein isoforms expressed in the healthy human neural retina. Moreover, it highlights the importance of studying tissue specific transcriptomes in greater detail to better understand tissue-specific regulation and to identify disease-causing variants.
Project description:To examine the mechanisms that control flower development, we sequenced the flower bud transcriptomes of ‘High Noon’, a reblooming cultivar of P. suffruticosa × P. lutea. Both full-length isoforms and RNA-seq were sequenced in 3 floral developmental stages. A total of 15.94 Gb raw data and 457.0 million reads were generated in full-length transcript sequencing and RNA-seq.
Project description:Deregulated gene expression is a hallmark of cancer, however most studies to date have analyzed short-read RNA-sequencing data with inherent limitations. Here, we combine PacBio long-read isoform sequencing (Iso-Seq) and Illumina paired-end short read RNA sequencing to comprehensively survey the transcriptome of gastric cancer (GC), a leading cause of global cancer mortality. We performed full-length transcriptome analysis across 10 GC cell lines covering four major GC molecular subtypes (chromosomal unstable, Epstein-Barr positive, genome stable and microsatellite unstable). We identify 60,239 non-redundant full-length transcripts, of which >66% are novel compared to current transcriptome databases. Novel isoforms are more likely to be cell-line and subtype specific, expressed at lower levels with larger number of exons, with longer isoform/coding sequence lengths. Most novel isoforms utilize an alternate first exon, and compared to other alternative splicing categories are expressed at higher levels and exhibit higher variability. Collectively, we observe alternate promoter usage in 25% of detected genes, with the majority (84.2%) of known/novel promoter pairs exhibiting potential changes in their coding sequences. Mapping these alternate promoters to TCGA GC samples, we identify several cancer-associated isoforms, including novel variants of oncogenes. Tumor-specific transcript isoforms tend to alter protein coding sequences to a larger extent than other isoforms. Analysis of outcome data suggests that novel isoforms may impart additional prognostic information. Our results provide a rich resource of full-length transcriptome data for deeper studies of GC and other gastrointestinal malignancies.
Project description:Alternative splicing is widely acknowledged to be a crucial regulator of gene expression and is a key contributor to both normal developmental processes and disease states. While cost-effective and accurate for quantification, short-read RNA-seq lacks the ability to resolve full-length transcript isoforms despite increasingly sophisticated computational methods. Long-read sequencing platforms such as Pacific Biosciences (PacBio) and Oxford Nanopore (ONT) bypass the transcript reconstruction challenges of short-reads. Here we describe TALON, the ENCODE4 pipeline for analyzing PacBio cDNA and ONT direct-RNA transcriptomes. We apply TALON to three human ENCODE Tier 1 cell lines and show that while both technologies perform well at full-transcript discovery and quantification, each one displayed distinct artifacts. We further apply TALON to mouse cortical and hippocampal transcriptomes and find that a substantial proportion of neuronal genes have more reads associated with novel isoforms than with annotated ones. These data show that TALON is a technology-agnostic long-read transcriptome discovery and quantification pipeline capable of tracking both known and novel transcript models, as well as their expression levels, across datasets for both simple studies and in larger projects. These properties will enable TALON users to move beyond the limitations of short-read data to perform isoform discovery and quantification in a uniform manner on existing and future long-read platforms.
Project description:Monocyte derived dendritic cells (MDDC) were infected with Leishmania major or Leishmania donovani parasites and collected at 4, 8, and 24 hours post-infection to analyze the differential effects those parasite species have on human host cell gene expression over time. Monocyte derived dendritic cells (MDDC) were generated from blood buffy coats collected from five anonymous healthy human donors and infected 10:1 (parasite to host cell) with Leishmania major Friedlin V1 strain or Leishmania donovani 1S strain parasites, where after 4, 8, or 24 hours total RNA was harvested from cells, cDNA generated, and hybridized to human gene transcipt expression arrays to assess differential host cell gene transcriptional expression differences relative to uninfected cells.