Unknown

Dataset Information

0

An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.


ABSTRACT: Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.

SUBMITTER: Deng X 

PROVIDER: S-EPMC4402509 | biostudies-literature | 2015 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data.

Deng Xutao X   Naccache Samia N SN   Ng Terry T   Federman Scot S   Li Linlin L   Chiu Charles Y CY   Delwart Eric L EL  

Nucleic acids research 20150113 7


Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or hom  ...[more]

Similar Datasets

| S-EPMC3441570 | biostudies-literature
| S-EPMC3246234 | biostudies-literature
| S-EPMC2631076 | biostudies-literature
| S-EPMC3528712 | biostudies-literature
| S-EPMC4030574 | biostudies-literature
| S-EPMC9804104 | biostudies-literature
| S-EPMC2864248 | biostudies-literature
| S-EPMC3577829 | biostudies-literature
| S-EPMC3585893 | biostudies-literature
| S-EPMC3767511 | biostudies-literature