Unknown

Dataset Information

0

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.


ABSTRACT:

Motivation

Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting tools too slow and memory intensive. At the same time, large multicore computers have become commonplace in research facilities allowing for a new parallel computational paradigm.

Results

We propose a new k-mer counting algorithm and associated implementation, called Jellyfish, which is fast and memory efficient. It is based on a multithreaded, lock-free hash table optimized for counting k-mers up to 31 bases in length. Due to their flexibility, suffix arrays have been the data structure of choice for solving many string problems. For the task of k-mer counting, important in many biological applications, Jellyfish offers a much faster and more memory-efficient solution.

Availability

The Jellyfish software is written in C++ and is GPL licensed. It is available for download at http://www.cbcb.umd.edu/software/jellyfish.

SUBMITTER: Marcais G 

PROVIDER: S-EPMC3051319 | biostudies-literature | 2011 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

A fast, lock-free approach for efficient parallel counting of occurrences of k-mers.

Marçais Guillaume G   Kingsford Carl C  

Bioinformatics (Oxford, England) 20110107 6


<h4>Motivation</h4>Counting the number of occurrences of every k-mer (substring of length k) in a long string is a central subproblem in many applications, including genome assembly, error correction of sequencing reads, fast multiple sequence alignment and repeat detection. Recently, the deep sequence coverage generated by next-generation sequencing technologies has caused the amount of sequence to be processed during a genome project to grow rapidly, and has rendered current k-mer counting too  ...[more]

Similar Datasets

| S-EPMC3166945 | biostudies-literature
| S-EPMC4248469 | biostudies-literature
| S-EPMC3598636 | biostudies-literature
| S-EPMC10316747 | biostudies-literature
| S-EPMC4111482 | biostudies-literature
2014-04-27 | GSE46151 | GEO
| S-EPMC5962546 | biostudies-literature
2014-04-27 | E-GEOD-46151 | biostudies-arrayexpress
| S-EPMC3464110 | biostudies-other
| S-EPMC3399813 | biostudies-literature