Dataset Information

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

ABSTRACT: Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequencing, halves depth-of-coverage requirements, and improves assembly continuity while simultaneously reducing runtime by an order of magnitude on large genomes versus Celera Assembler 8.2. These advances result from new overlapping and assembly algorithms, including an adaptive overlapping strategy based on tf-idf weighted MinHash and a sparse assembly graph construction that avoids collapsing diverged repeats and haplotypes. We demonstrate that Canu can reliably assemble complete microbial genomes and near-complete eukaryotic chromosomes using either Pacific Biosciences (PacBio) or Oxford Nanopore technologies and achieves a contig NG50 of >21 Mbp on both human and Drosophila melanogaster PacBio data sets. For assembly structures that cannot be linearly represented, Canu provides graph-based assembly outputs in graphical fragment assembly (GFA) format for analysis or integration with complementary phasing and scaffolding techniques. The combination of such highly resolved assembly graphs with long-range scaffolding information promises the complete and automated assembly of complex genomes.

SUBMITTER: Koren S

PROVIDER: S-EPMC5411767 | biostudies-literature | 2017 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Canu: scalable and accurate long-read assembly via adaptive <i>k</i>-mer weighting and repeat separation.

Koren Sergey S Walenz Brian P BP Berlin Konstantin K Miller Jason R JR Bergman Nicholas H NH Phillippy Adam M AM

Genome research 20170315 5

Long-read single-molecule sequencing has revolutionized de novo genome assembly and enabled the automated reconstruction of reference-quality genomes. However, given the relatively high error rates of such technologies, efficient and accurate assembly of large repeats and closely related haplotypes remains challenging. We address these issues with Canu, a successor of Celera Assembler that is specifically designed for noisy single-molecule sequences. Canu introduces support for nanopore sequenci ...[more]

PMID: 28298431

Dataset Information

Canu: scalable and accurate long-read assembly via adaptive k-mer weighting and repeat separation.

Publications

Canu: scalable and accurate long-read assembly via adaptive <i>k</i>-mer weighting and repeat separation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

metaFlye: scalable long-read metagenome assembly using repeat graphs.
| S-EPMC10699202 | biostudies-literature

HINGE: long-read assembly achieves optimal repeat resolution.
| S-EPMC5411769 | biostudies-literature

Fast and accurate long-read assembly with wtdbg2.
| S-EPMC7004874 | biostudies-literature

Accurate long-read de novo assembly evaluation with Inspector.
| S-EPMC8590762 | biostudies-literature

Scalable long read self-correction and assembly polishing with multiple sequence alignment.
| S-EPMC7804095 | biostudies-literature

Ultraplexing: increasing the efficiency of long-read sequencing for hybrid assembly with k-mer-based multiplexing.
| S-EPMC7071681 | biostudies-literature

Contiguous and accurate de novo assembly of metazoan genomes with modest long read coverage.
| S-EPMC5100563 | biostudies-literature

Compact representation of k-mer de Bruijn graphs for genome read assembly.
| S-EPMC4015147 | biostudies-literature

Local read haplotagging enables accurate long-read small variant calling.
| S-EPMC10515762 | biostudies-literature

Local read haplotagging enables accurate long-read small variant calling.
| S-EPMC11246426 | biostudies-literature