Dataset Information

Feature-by-feature--evaluating de novo sequence assembly.

ABSTRACT: The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the "excess-dimensionality" of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.

SUBMITTER: Vezzi F

PROVIDER: S-EPMC3272011 | biostudies-literature | 2012

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Feature-by-feature--evaluating de novo sequence assembly.

Vezzi Francesco F Narzisi Giuseppe G Mishra Bud B

PloS one 20120203 2

The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the informat ...[more]

PMID: 22319599

Similar Datasets

Project description:Venom-gland transcriptomics is a key tool in the study of the evolution, ecology, function, and pharmacology of animal venoms. In particular, gene-expression variation and coding sequences gained through transcriptomics provide key information for explaining functional venom variation over both ecological and evolutionary timescales. The accuracy and usefulness of inferences made through transcriptomics, however, is limited by the accuracy of the transcriptome assembly, which is a bioinformatic problem with several possible solutions. Several methods have been employed to assemble venom-gland transcriptomes, with the Trinity assembler being the most commonly applied among them. Although previous evidence of variation in performance among assembly software exists, particularly regarding recovery of difficult-to-assemble multigene families such as snake venom metalloproteinases, much work to date still employs a single assembly method. We evaluated the performance of several commonly used de novo assembly methods for the recovery of both nontoxin transcripts and complete, high-quality venom-gene transcripts across eleven snake and four scorpion transcriptomes. We varied k-mer sizes used by some assemblers to evaluate the impact of k-mer length on transcript recovery. We showed that the recovery of nontoxin transcripts and toxin transcripts is best accomplished through different assembly software, with SDT at smaller k-mer lengths and Trinity being best for nontoxin recovery and a combination of SeqMan NGen and a seed-and-extend approach implemented in Extender as the best means of recovering a complete set of toxin transcripts. In particular, Extender was the only means tested capable of assembling multiple isoforms of the diverse snake venom metalloproteinase family, while traditional approaches such as Trinity recovered at most one metalloproteinase transcript. Our work demonstrated that traditional metrics of assembly performance are not predictive of performance in the recovery of complete and high quality toxin genes. Instead, effective venom-gland transcriptomic studies should combine and quality-filter the results of several assemblers with varying algorithmic strategies.

Dataset Information

Feature-by-feature--evaluating de novo sequence assembly.

Publications

Feature-by-feature--evaluating de novo sequence assembly.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets