Dataset Information

A space and time-efficient index for the compacted colored de Bruijn graph.

ABSTRACT:

Motivation

Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large.

Results

We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment.

Availability and implementation

pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Almodaresi F

PROVIDER: S-EPMC6022659 | biostudies-literature | 2018 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A space and time-efficient index for the compacted colored de Bruijn graph.

Almodaresi Fatemeh F Sarkar Hirak H Srivastava Avi A Patro Rob R

Bioinformatics (Oxford, England) 20180701 13

<h4>Motivation</h4>Indexing reference sequences for search-both individual genomes and collections of genomes-is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to ...[more]

PMID: 29949982

Dataset Information

A space and time-efficient index for the compacted colored de Bruijn graph.

Motivation

Results

Availability and implementation

Supplementary information

Publications

A space and time-efficient index for the compacted colored de Bruijn graph.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Graphite: painting genomes using a colored de Bruijn graph.
| S-EPMC11497850 | biostudies-literature

Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs.
| S-EPMC7499882 | biostudies-literature

Extremely fast construction and querying of compacted and colored de Bruijn graphs with GGCAT.
| S-EPMC10538363 | biostudies-literature

Space-efficient and exact de Bruijn graph representation based on a Bloom filter.
| S-EPMC3848682 | biostudies-literature

Fast and Scalable Parallel External-Memory Construction of Colored Compacted de Bruijn Graphs with Cuttlefish 3.
| S-EPMC11838517 | biostudies-literature

Succinct colored de Bruijn graphs.
| S-EPMC5872255 | biostudies-literature

Pan-genome de Bruijn graph using the bidirectional FM-index.
| S-EPMC10605969 | biostudies-literature

HaVec: An Efficient de Bruijn Graph Construction Algorithm for Genome Assembly.
| S-EPMC5591975 | biostudies-literature

Fast and efficient Rmap assembly using the Bi-labelled de Bruijn graph.
| S-EPMC8147420 | biostudies-literature

Building large updatable colored de Bruijn graphs via merging.
| S-EPMC6612864 | biostudies-literature