Project description:New tools for improved long-read transcript assembly and coalescence with its short-read counterpart are required. Using our short- and long-read measurements from different cell lines with spiked-in standards, we systematically compared key parameters and biases in the read alignment and assembly of transcripts. We report a cDNA synthesis artifact in long-read datasets that impacts the identity and quantitation of assembled transcripts. We developed a computational pipeline to strand long-read cDNA libraries that markedly improves assembly of transcripts from long-reads. Incorporating stranded long-reads in a new hybrid assembly approach, we demonstrate its efficacy for improved characterization of challenging lncRNA transcripts. Our workflow can be applied to a wide range of transcriptomics datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts.
Project description:New tools for improved long-read transcript assembly and coalescence with its short-read counterpart are required. Using our short- and long-read measurements from different cell lines with spiked-in standards, we systematically compared key parameters and biases in the read alignment and assembly of transcripts. We report a cDNA synthesis artifact in long-read datasets that impacts the identity and quantitation of assembled transcripts. We developed a computational pipeline to strand long-read cDNA libraries that markedly improves assembly of transcripts from long-reads. Incorporating stranded long-reads in a new hybrid assembly approach, we demonstrate its efficacy for improved characterization of challenging lncRNA transcripts. Our workflow can be applied to a wide range of transcriptomics datasets for superior demarcation of transcript ends and refined isoform structure, which can enable better differential gene expression analyses and molecular manipulations of transcripts.
Project description:We want to develop transcriptome assembly pipeline that significantly improves the quality of the assemblies constructed using stranded and/or unstranded RNA-seq data. Transcriptome of mouse embryonic stem cells (mESC) were assembled using stranded and unstranded library generated by Illumina HiSeq 2000
Project description:We want to develop transcriptome assembly pipeline that significantly improves the quality of the assemblies constructed using stranded and/or unstranded RNA-seq data.
Project description:We explored changes at gene-level or transcript-level in embryonic stem cells, before and after in vitro differentiation with retinoic acid. RNA was sequenced both via Illumina short reads, and with Oxford Nanopore Technology with cDNA and direct RNA sequencing.
Project description:Accurate annotation of transcript isoforms is crucial to understand gene functions, but automated methods for reconstructing full-length transcripts from RNA sequencing (RNA-seq) data remain imprecise. We developed Bookend, a software package for transcript assembly that incorporates data from different RNA-seq techniques, with a focus on identifying and utilizing RNA 5′ and 3′ ends. Through end-guided assembly with Bookend we demonstrate that correct modeling of transcript start and end sites is essential for precise transcript assembly. Furthermore, we discovered that utilization of end-labeled reads present in full-length single-cell RNA-seq (scRNA-seq) datasets dramatically improves the precision of transcript assembly in single cells. Finally, we show that hybrid assembly across short-read, long-read, and end-capture RNA-seq datasets from Arabidopsis, as well as meta-assembly of RNA-seq from single mouse embryonic stem cells (mESCs) can produce end-to-end transcript annotations of comparable quality to reference annotations in these model organisms.
Project description:BackgroundLong-read and short-read sequencing technologies offer competing advantages for eukaryotic genome sequencing projects. Combinations of both may be appropriate for surveys of within-species genomic variation.MethodsWe developed a hybrid assembly pipeline called "Alpaca" that can operate on 20X long-read coverage plus about 50X short-insert and 50X long-insert short-read coverage. To preclude collapse of tandem repeats, Alpaca relies on base-call-corrected long reads for contig formation.ResultsCompared to two other assembly protocols, Alpaca demonstrated the most reference agreement and repeat capture on the rice genome. On three accessions of the model legume Medicago truncatula, Alpaca generated the most agreement to a conspecific reference and predicted tandemly repeated genes absent from the other assemblies.ConclusionOur results suggest Alpaca is a useful tool for investigating structural and copy number variation within de novo assemblies of sampled populations.
Project description:As next-generation sequence (NGS) production continues to increase, analysis is becoming a significant bottleneck. However, in situations where information is required only for specific sequence variants, it is not necessary to assemble or align whole genome data sets in their entirety. Rather, NGS data sets can be mined for the presence of sequence variants of interest by localized assembly, which is a faster, easier, and more accurate approach. We present TASR, a streamlined assembler that interrogates very large NGS data sets for the presence of specific variants by only considering reads within the sequence space of input target sequences provided by the user. The NGS data set is searched for reads with an exact match to all possible short words within the target sequence, and these reads are then assembled stringently to generate a consensus of the target and flanking sequence. Typically, variants of a particular locus are provided as different target sequences, and the presence of the variant in the data set being interrogated is revealed by a successful assembly outcome. However, TASR can also be used to find unknown sequences that flank a given target. We demonstrate that TASR has utility in finding or confirming genomic mutations, polymorphisms, fusions and integration events. Targeted assembly is a powerful method for interrogating large data sets for the presence of sequence variants of interest. TASR is a fast, flexible and easy to use tool for targeted assembly.
Project description:We explored changes at gene-level or transcript-level in embryonic stem cells, before and after in vitro differentiation with retinoic acid. RNA was sequenced both via Illumina short reads, and with Oxford Nanopore Technology with cDNA and direct RNA sequencing.