Dataset Information

Compact and evenly distributed k-mer binning for genomic sequences.

ABSTRACT:

Motivation

The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers-ordered m-mers where m < k-are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored.

Results

We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available.

Availability and implementation

Discount is GPL licensed and available at https://github.com/jtnystrom/discount. The data underlying this article are available in the article and in its online supplementary material.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Nystrom-Persson J

PROVIDER: S-EPMC8428581 | biostudies-literature | 2021 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Compact and evenly distributed k-mer binning for genomic sequences.

Nyström-Persson Johan J Keeble-Gagnère Gabriel G Zawad Niamat N

Bioinformatics (Oxford, England) 20210901 17

<h4>Motivation</h4>The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers-ordered m-mers where m < k-are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and ...[more]

PMID: 33693556

Dataset Information

Compact and evenly distributed k-mer binning for genomic sequences.

Motivation

Results

Availability and implementation

Supplementary information

Publications

Compact and evenly distributed k-mer binning for genomic sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Towards evenly distributed grazing patterns: including social context in sheep management strategies.
| S-EPMC4924134 | biostudies-literature

Precision Oncology Beyond Genomics: The Future Is Here-It Is Just Not Evenly Distributed.
| S-EPMC8072767 | biostudies-literature

Binning unassembled short reads based on k-mer abundance covariance using sparse coding.
| S-EPMC7099633 | biostudies-literature

Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes.
| S-EPMC4828714 | biostudies-literature

High Density Arrayed Ni/NiO Core-shell Nanospheres Evenly Distributed on Graphene for Ultrahigh Performance Supercapacitor.
| S-EPMC5735128 | biostudies-literature

Evenly Distributed Microporous Structure and E7 Peptide Functionalization Synergistically Accelerate Osteogenesis and Angiogenesis in Engineered Periosteum.
| S-EPMC11923966 | biostudies-literature

Binning sequences using very sparse labels within a metagenome.
| S-EPMC2383919 | biostudies-literature

Compact representation of k-mer de Bruijn graphs for genome read assembly.
| S-EPMC4015147 | biostudies-literature

The aminoglycoside 6'-N-acetyltransferase type Ib encoded by Tn1331 is evenly distributed within the cell's cytoplasm.
| S-EPMC182613 | biostudies-literature

Improving dOCT image quality with short sequences and automated binning.
| S-EPMC12532347 | biostudies-literature