Dataset Information

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

ABSTRACT: Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 - 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.

SUBMITTER: Pellow D

PROVIDER: S-EPMC5467106 | biostudies-literature | 2017 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

Pellow David D Filippova Darya D Kingsford Carl C

Journal of computational biology : a journal of computational molecular cell biology 20161109 6

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the mem ...[more]

PMID: 27828710

Dataset Information

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

Publications

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters.
| S-EPMC7382288 | biostudies-literature

BioBloom tools: fast, accurate and memory-efficient host species sequence screening using bloom filters.
| S-EPMC4816029 | biostudies-literature

Classification of DNA sequences using Bloom filters.
| S-EPMC2887045 | biostudies-literature

Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries.
| S-EPMC10230713 | biostudies-literature

streammd: fast low-memory duplicate marking using a Bloom filter.
| S-EPMC10112951 | biostudies-literature

ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter.
| S-EPMC5411771 | biostudies-literature

MetaProFi: an ultrafast chunked Bloom filter for storing and querying protein and nucleotide sequence data for accurate identification of functionally relevant genetic variants.
| S-EPMC9994790 | biostudies-literature

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.
| S-EPMC4832552 | biostudies-literature

DNA Bloom Filter enables anti-contamination and file version control for DNA-based data storage.
| S-EPMC10981766 | biostudies-literature

Filter forensics: microbiota recovery from residential HVAC filters.
| S-EPMC5791358 | biostudies-literature