Unknown

Dataset Information

0

Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression.


ABSTRACT: Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections makes it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs, and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids, or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.

SUBMITTER: Brinda K 

PROVIDER: S-EPMC10153118 | biostudies-literature | 2023 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

Efficient and Robust Search of Microbial Genomes via Phylogenetic Compression.

Břinda Karel K   Lima Leandro L   Pignotti Simone S   Quinones-Olvera Natalia N   Salikhov Kamil K   Chikhi Rayan R   Kucherov Gregory G   Iqbal Zamin Z   Baym Michael M  

bioRxiv : the preprint server for biology 20240511


Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as BLAST and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structure  ...[more]

Similar Datasets

| S-EPMC10680779 | biostudies-literature
| S-EPMC1370640 | biostudies-literature
| S-EPMC7206332 | biostudies-literature
| S-EPMC5243798 | biostudies-literature
| S-EPMC3526167 | biostudies-literature
| S-EPMC10291887 | biostudies-literature
| S-EPMC7237447 | biostudies-literature
| S-EPMC4251999 | biostudies-literature
| S-EPMC8565063 | biostudies-literature
| S-BSST1002 | biostudies-other