Dataset Information

Compact representation of k-mer de Bruijn graphs for genome read assembly.

ABSTRACT:

Background

Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure.

Results

The FM-index, which is based on the Burrows-Wheeler transform, provides an efficient data structure providing a searchable index of all substrings from a set of strings, and is used to compactly represent full genomes for use in mapping reads to a genome: the memory required to store this is in the same order of magnitude as the strings themselves. However, reads from high throughput sequences mostly have high coverage and so contain the same substrings multiple times from different reads. I here present a modification of the FM-index, which I call the kFM-index, for indexing the set of k-mers from the reads. For DNA sequences, this requires 5 bit of information for each vertex of the corresponding de Bruijn subgraph, i.e. for each different k-1-mer, plus some additional overhead, typically 0.5 to 1 bit per vertex, for storing the equivalent of the FM-index for walking the underlying de Bruijn graph and reproducing the actual k-mers efficiently.

Conclusions

The kFM-index could replace more memory demanding data structures for storing the de Bruijn k-mer graph representation of sequence reads. A Java implementation with additional technical documentation is provided which demonstrates the applicability of the data structure (http://folk.uio.no/einarro/Projects/KFM-index/).

SUBMITTER: Rodland EA

PROVIDER: S-EPMC4015147 | biostudies-literature | 2013 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Compact representation of k-mer de Bruijn graphs for genome read assembly.

Rødland Einar Andreas EA

BMC bioinformatics 20131023

<h4>Background</h4>Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure.<h4>Results</h4>The FM-index, which is based on the Burrows-Wheeler transform, provides an efficient data structure providing a searchable index of all ...[more]

PMID: 24152242

Dataset Information

Compact representation of k-mer de Bruijn graphs for genome read assembly.

Background

Results

Conclusions

Publications

Compact representation of k-mer de Bruijn graphs for genome read assembly.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Velvet: algorithms for de novo short read assembly using de Bruijn graphs.
| S-EPMC2336801 | biostudies-literature

Parallelized short read assembly of large genomes using de Bruijn graphs.
| S-EPMC3167803 | biostudies-literature

How to apply de Bruijn graphs to genome assembly.
| S-EPMC5531759 | biostudies-literature

Metagenome SNP calling via read-colored de Bruijn graphs.
| S-EPMC8016496 | biostudies-literature

Simplitigs as an efficient and scalable representation of de Bruijn graphs.
| S-EPMC8025321 | biostudies-literature

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs.
| S-EPMC3421212 | biostudies-literature

De novo assembly and genotyping of variants using colored de Bruijn graphs.
| S-EPMC3272472 | biostudies-literature

cloudSPAdes: assembly of synthetic long reads using de Bruijn graphs.
| S-EPMC6612831 | biostudies-literature

Assembly of long error-prone reads using de Bruijn graphs.
| S-EPMC5206522 | biostudies-literature

Succinct dynamic de Bruijn graphs.
| S-EPMC8337006 | biostudies-literature