Project description:BackgroundThe quality of gene annotation determines the interpretation of results obtained in transcriptomic studies. The growing number of genome sequence information calls for experimental and computational pipelines for de novo transcriptome annotation. Ideally, gene and transcript models should be called from a limited set of key experimental data.ResultsWe developed TranscriptomeReconstructoR, an R package which implements a pipeline for automated transcriptome annotation. It relies on integrating features from independent and complementary datasets: (i) full-length RNA-seq for detection of splicing patterns and (ii) high-throughput 5' and 3' tag sequencing data for accurate definition of gene borders. The pipeline can also take a nascent RNA-seq dataset to supplement the called gene model with transient transcripts. We reconstructed de novo the transcriptional landscape of wild type Arabidopsis thaliana seedlings and Saccharomyces cerevisiae cells as a proof-of-principle. A comparison to the existing transcriptome annotations revealed that our gene model is more accurate and comprehensive than the most commonly used community gene models, TAIR10 and Araport11 for A.thaliana and SacCer3 for S.cerevisiae. In particular, we identify multiple transient transcripts missing from the existing annotations. Our new annotations promise to improve the quality of A.thaliana and S.cerevisiae genome research.ConclusionsOur proof-of-concept data suggest a cost-efficient strategy for rapid and accurate annotation of complex eukaryotic transcriptomes. We combine the choice of library preparation methods and sequencing platforms with the dedicated computational pipeline implemented in the TranscriptomeReconstructoR package. The pipeline only requires prior knowledge on the reference genomic DNA sequence, but not the transcriptome. The package seamlessly integrates with Bioconductor packages for downstream analysis.
Project description:MotivationSingle-cell RNA sequencing (scRNA-seq) analysis reveals heterogeneity and dynamic cell transitions. However, conventional gene-based analyses require intensive manual curation to interpret biological implications of computational results. Hence, a theory for efficiently annotating individual cells remains warranted.ResultsWe present ASURAT, a computational tool for simultaneously performing unsupervised clustering and functional annotation of disease, cell type, biological process and signaling pathway activity for single-cell transcriptomic data, using a correlation graph decomposition for genes in database-derived functional terms. We validated the usability and clustering performance of ASURAT using scRNA-seq datasets for human peripheral blood mononuclear cells, which required fewer manual curations than existing methods. Moreover, we applied ASURAT to scRNA-seq and spatial transcriptome datasets for human small cell lung cancer and pancreatic ductal adenocarcinoma, respectively, identifying previously overlooked subpopulations and differentially expressed genes. ASURAT is a powerful tool for dissecting cell subpopulations and improving biological interpretability of complex and noisy transcriptomic data.Availability and implementationASURAT is published on Bioconductor (https://doi.org/10.18129/B9.bioc.ASURAT). The codes for analyzing data in this article are available at Github (https://github.com/keita-iida/ASURATBI) and figshare (https://doi.org/10.6084/m9.figshare.19200254.v4).Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:During production of the original article [1], there was a technical error that resulted in author corrections not being rendered in the PDF version of the article.
Project description:The HTML version of this Article incorrectly omits Supplementary Movie 1. Supplementary Movie 1 can be found as Supplementary Information associated with this Correction.
Project description:Numerous methods have been developed to analyse RNA sequencing (RNA-seq) data, but most rely on the availability of a reference genome, making them unsuitable for non-model organisms. Here we present superTranscripts, a substitute for a reference genome, where each gene with multiple transcripts is represented by a single sequence. The Lace software is provided to construct superTranscripts from any set of transcripts, including de novo assemblies. We demonstrate how superTranscripts enable visualisation, variant detection and differential isoform detection in non-model organisms. We further use Lace to combine reference and assembled transcriptomes for chicken and recover hundreds of gaps in the reference genome.
Project description:In the originally published version of this Article, the affiliation details for Santi González, Jian'an Luan and Claudia Langenberg were inadvertently omitted. Santi González should have been affiliated with 'Barcelona Supercomputing Center (BSC), Joint BSC-CRG-IRB Research Program in Computational Biology, 08034 Barcelona, Spain', and Jian'an Luan and Claudia Langenberg should have been affiliated with 'MRC Epidemiology Unit, University of Cambridge School of Clinical Medicine, Cambridge Biomedical Campus, Cambridge CB2 0QQ, UK'. Furthermore, the abstract contained an error in the SNP ID for the rare variant in chromosome Xq23, which was incorrectly given as rs146662057 and should have been rs146662075. These errors have now been corrected in both the PDF and HTML versions of the Article.