Dataset Information

Optimization of de novo transcriptome assembly from next-generation sequencing data.

ABSTRACT: Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogeneous. We present the Multiple-k method in which various k-mer lengths are used for de novo transcriptome assembly. We demonstrate its good performance by assembling de novo a published next-generation transcriptome sequence data set of Aedes aegypti, using the existing genome to check the accuracy of our method. The second method relies on the use of a reference proteome to improve the de novo assembly. We developed the Scaffolding using Translation Mapping (STM) method that uses mapping against the closest available reference proteome for scaffolding contigs that map onto the same protein. In a controlled experiment using simulated data, we show that the STM method considerably improves the assembly, with few errors. We applied these two methods to assemble the transcriptome of the non-model catfish Loricaria gr. cataphracta. Using the Multiple-k and STM methods, the assembly increases in contiguity and in gene identification, showing that our methods clearly improve quality and can be widely used. The new methods were used to assemble successfully the transcripts of the core set of genes regulating tooth development in vertebrates, while classic de novo assembly failed.

SUBMITTER: Surget-Groba Y

PROVIDER: S-EPMC2945192 | biostudies-literature | 2010 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Optimization of de novo transcriptome assembly from next-generation sequencing data.

Surget-Groba Yann Y Montoya-Burgos Juan I JI

Genome research 20100806 10

Transcriptome analysis has important applications in many biological fields. However, assembling a transcriptome without a known reference remains a challenging task requiring algorithmic improvements. We present two methods for substantially improving transcriptome de novo assembly. The first method relies on the observation that the use of a single k-mer length by current de novo assemblers is suboptimal to assemble transcriptomes where the sequence coverage of transcripts is highly heterogene ...[more]

PMID: 20693479

Similar Datasets

Project description:BackgroundBarnyardgrass (Echinochloa crus-galli) is an important weed that is a menace to rice cultivation and production. Rapid evolution of herbicide resistance in this weed makes it one of the most difficult to manage using herbicides. Since genome-wide sequence data for barnyardgrass is limited, we sequenced the transcriptomes of susceptible and resistant barnyardgrass biotypes using the 454 GS-FLX platform.Results454 pyrosequencing generated 371,281 raw reads with an average length of 341.8 bp, which made a total length of 126.89 Mb (SRX160526). De novo assembly produced 10,142 contigs (∼5.92 Mb) with an average length of 583 bp and 68,940 singletons (∼22.13 Mb) with an average length of 321 bp. About 244,653 GO term assignments to the biological process, cellular component and molecular function categories were obtained. A total of 6,092 contigs and singletons with 2,515 enzyme commission numbers were assigned to 151 predicted KEGG metabolic pathways. Digital abundance analysis using Illumina sequencing identified 78,124 transcripts among susceptible, resistant, herbicide-treated susceptible and herbicide-treated resistant barnyardgrass biotypes. From these analyses, eight herbicide target-site gene groups and four non-target-site gene groups were identified in the resistant biotype. These could be potential candidate genes involved in the herbicide resistance of barnyardgrass and could be used for further functional genomics research. C4 photosynthesis genes including RbcS, RbcL, NADP-me and MDH with complete CDS were identified using PCR and RACE technology.ConclusionsThis is the first large-scale transcriptome sequencing of E. crus-galli performed using the 454 GS-FLX platform. Potential candidate genes involved in the evolution of herbicide resistance were identified from the assembled sequences. This transcriptome data may serve as a reference for further gene expression and functional genomics studies, and will facilitate the study of herbicide resistance at the molecular level in this species as well as other weeds.

Project description:The k-mer hash length is a key factor affecting the output of de novo transcriptome assembly packages using de Bruijn graph algorithms. Assemblies constructed with varying single k-mer choices might result in the loss of unique contiguous sequences (contigs) and relevant biological information. A common solution to this problem is the clustering of single k-mer assemblies. Even though annotation is one of the primary goals of a transcriptome assembly, the success of assembly strategies does not consider the impact of k-mer selection on the annotation output. This study provides an in-depth k-mer selection analysis that is focused on the degree of functional annotation achieved for a non-model organism where no reference genome information is available. Individual k-mers and clustered assemblies (CA) were considered using three representative software packages. Pair-wise comparison analyses (between individual k-mers and CAs) were produced to reveal missing Kyoto Encyclopedia of Genes and Genomes (KEGG) ortholog identifiers (KOIs), and to determine a strategy that maximizes the recovery of biological information in a de novo transcriptome assembly.Analyses of single k-mer assemblies resulted in the generation of various quantities of contigs and functional annotations within the selection window of k-mers (k-19 to k-63). For each k-mer in this window, generated assemblies contained certain unique contigs and KOIs that were not present in the other k-mer assemblies. Producing a non-redundant CA of k-mers 19 to 63 resulted in a more complete functional annotation than any single k-mer assembly. However, a fraction of unique annotations remained (~0.19 to 0.27% of total KOIs) in the assemblies of individual k-mers (k-19 to k-63) that were not present in the non-redundant CA. A workflow to recover these unique annotations is presented.This study demonstrated that different k-mer choices result in various quantities of unique contigs per single k-mer assembly which affects biological information that is retrievable from the transcriptome. This undesirable effect can be minimized, but not eliminated, with clustering of multi-k assemblies with redundancy removal. The complete extraction of biological information in de novo transcriptomics studies requires both the production of a CA and efforts to identify unique contigs that are present in individual k-mer assemblies but not in the CA.

Dataset Information

Optimization of de novo transcriptome assembly from next-generation sequencing data.

Publications

Optimization of de novo transcriptome assembly from next-generation sequencing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets