Dataset Information

Disk-based k-mer counting on a PC.

ABSTRACT:

Background

The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.

Results

We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, while demanding a relatively small amount of memory. In particular, it is capable of counting the statistics for short-read human genome data, in input gzipped FASTQ file, in less than 40 minutes on a PC with 16 GB of RAM and 6 CPU cores, and for long-read human genome data in less than 70 minutes. On a more powerful machine, using 32 GB of RAM and 32 CPU cores, the tasks are accomplished in less than half the time. No other algorithm for most tested settings of this problem and mammalian-size data can accomplish this task in comparable time. Our solution also belongs to memory-frugal ones; most competitive algorithms cannot efficiently work on a PC with 16 GB of memory for such massive data.

Conclusions

By making use of cheap disk space and exploiting CPU and I/O parallelism we propose a very competitive k-mer counting procedure, called KMC. Our results suggest that judicious resource management may allow to solve at least some bioinformatics problems with massive data on a commodity personal computer.

SUBMITTER: Deorowicz S

PROVIDER: S-EPMC3680041 | biostudies-literature | 2013 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Disk-based k-mer counting on a PC.

Deorowicz Sebastian S Debudaj-Grabysz Agnieszka A Grabowski Szymon S

BMC bioinformatics 20130516

<h4>Background</h4>The k-mer counting problem, which is to build the histogram of occurrences of every k-symbol long substring in a given text, is important for many bioinformatics applications. They include developing de Bruijn graph genome assemblers, fast multiple sequence alignment and repeat detection.<h4>Results</h4>We propose a simple, yet efficient, parallel disk-based algorithm for counting k-mers. Experiments show that it usually offers the fastest solution to the considered problem, w ...[more]

PMID: 23679007

Dataset Information

Disk-based k-mer counting on a PC.

Background

Results

Conclusions

Publications

Disk-based k-mer counting on a PC.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

A benchmark study of k-mer counting methods for high-throughput sequencing.
| S-EPMC6280066 | biostudies-other

The K-mer File Format: a standardized and compact disk representation of sets of k-mers.
| S-EPMC9477520 | biostudies-literature

Counting fluorescently labeled proteins in tissues in the spinning-disk microscope using single-molecule calibrations.
| S-EPMC9265152 | biostudies-literature

STRling: a k-mer counting approach that detects short tandem repeat expansions at known and novel loci.
| S-EPMC9753380 | biostudies-literature

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.
| S-EPMC4111482 | biostudies-literature

PC Deficiency Testing: Thrombin-Thrombomodulin as PC Activator and Aptamer-Based Enzyme Capturing Increase Diagnostic Accuracy.
| S-EPMC8542722 | biostudies-literature

Calibration tools for PC-based vision assessment.
| S-EPMC3897264 | biostudies-other

Conformation-based refinement of 18-mer DNA structures.
| S-EPMC10306069 | biostudies-literature

FQSqueezer: k-mer-based compression of sequencing data.
| S-EPMC6969201 | biostudies-literature

Development of a Lab-on-a-Disk Platform with Digital Imaging for Identification and Counting of Parasite Eggs in Human and Animal Stool.
| S-EPMC6952989 | biostudies-literature