Dataset Information

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

ABSTRACT: BACKGROUND:High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences "colored" by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from the sequences all colored k-mers, strings of length k, and stores them in vertices. RESULTS:In this paper, we present an alignment-free, reference-free and incremental data structure for storing a pan-genome as a C-DBG: the bloom filter trie (BFT). The data structure allows to store and compress a set of colored k-mers, and also to efficiently traverse the graph. Bloom filter trie was used to index and query different pangenome datasets. Compared to another state-of-the-art data structure, BFT was up to two times faster to build while using about the same amount of main memory. For querying k-mers, BFT was about 52-66 times faster while using about 5.5-14.3 times less memory. CONCLUSION:We present a novel succinct data structure called the Bloom Filter Trie for indexing a pan-genome as a colored de Bruijn graph. The trie stores k-mers and their colors based on a new representation of vertices that compress and index shared substrings. Vertices use basic data structures for lightweight substrings storage as well as Bloom filters for efficient trie and graph traversals. Experimental results prove better performance compared to another state-of-the-art data structure. AVAILABILITY:https://www.github.com/GuillaumeHolley/BloomFilterTrie.

SUBMITTER: Holley G

PROVIDER: S-EPMC4832552 | biostudies-literature | 2016

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

Holley Guillaume G Wittler Roland R Stoye Jens J

Algorithms for molecular biology : AMB 20160414

<h4>Background</h4>High throughput sequencing technologies have become fast and cheap in the past years. As a result, large-scale projects started to sequence tens to several thousands of genomes per species, producing a high number of sequences sampled from each genome. Such a highly redundant collection of very similar sequences is called a pan-genome. It can be transformed into a set of sequences "colored" by the genomes to which they belong. A colored de Bruijn graph (C-DBG) extracts from th ...[more]

PMID: 27087830

Similar Datasets

Project description:Whole genome sequencing has become fast, accurate, and cheap, paving the way towards the large-scale collection and processing of human genome data. Unfortunately, this dawning genome era does not only promise tremendous advances in biomedical research but also causes unprecedented privacy risks for the many. Handling storage and processing of large genome datasets through cloud services greatly aggravates these concerns. Current research efforts thus investigate the use of strong cryptographic methods and protocols to implement privacy-preserving genomic computations.We propose FHE-BLOOM and PHE-BLOOM, two efficient approaches for genetic disease testing using homomorphically encrypted Bloom filters. Both approaches allow the data owner to securely outsource storage and computation to an untrusted cloud. FHE-BLOOM is fully secure in the semi-honest model while PHE-BLOOM slightly relaxes security guarantees in a trade-off for highly improved performance.We implement and evaluate both approaches on a large dataset of up to 50 patient genomes each with up to 1000000 variations (single nucleotide polymorphisms). For both implementations, overheads scale linearly in the number of patients and variations, while PHE-BLOOM is faster by at least three orders of magnitude. For example, testing disease susceptibility of 50 patients with 100000 variations requires only a total of 308.31 s (σ=8.73 s) with our first approach and a mere 0.07 s (σ=0.00 s) with the second. We additionally discuss security guarantees of both approaches and their limitations as well as possible extensions towards more complex query types, e.g., fuzzy or range queries.Both approaches handle practical problem sizes efficiently and are easily parallelized to scale with the elastic resources available in the cloud. The fully homomorphic scheme, FHE-BLOOM, realizes a comprehensive outsourcing to the cloud, while the partially homomorphic scheme, PHE-BLOOM, trades a slight relaxation of security guarantees against performance improvements by at least three orders of magnitude.

Dataset Information

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

Publications

Bloom Filter Trie: an alignment-free and reference-free data structure for pan-genome storage.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets