Dataset Information

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.

ABSTRACT:

Motivation

K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient.

Results

We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure.

Availability and implementation

A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Liu S

PROVIDER: S-EPMC9235470 | biostudies-literature | 2022 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.

Liu Shaopeng S Koslicki David D

Bioinformatics (Oxford, England) 20220601 Suppl 1

<h4>Motivation</h4>K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-bas ...[more]

PMID: 35758788

Dataset Information

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.

Motivation

Results

Availability and implementation

Supplementary information

Publications

CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data.
| S-EPMC6929325 | biostudies-literature

Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation.
| S-EPMC10505501 | biostudies-literature

Noninvasive Risk Prediction Models for Heart Failure Using Proportional Jaccard Indices and Comorbidity Patterns.
| S-EPMC11267177 | biostudies-literature

Dataset of Jaccard similarity indices from 1,597 European political manifestos across 27 countries (1945-2017).
| S-EPMC6479077 | biostudies-literature

Observation resolution critically influences movement-based foraging indices.
| S-EPMC6754423 | biostudies-literature

Fast Object Motion Estimation Based on Dynamic Stixels.
| S-EPMC5017348 | biostudies-literature

KAnalyze: a fast versatile pipelined k-mer toolkit.
| S-EPMC4080738 | biostudies-literature

KF-NIPT: K-mer and fetal fraction-based estimation of chromosomal anomaly from NIPT data.
| S-EPMC12100778 | biostudies-literature

Robust k-mer frequency estimation using gapped k-mers.
| S-EPMC3895138 | biostudies-literature

Rapid species-level metagenome profiling and containment estimation with sylph.
| S-EPMC12339375 | biostudies-literature