Dataset Information

Data-dependent bucketing improves reference-free compression of sequencing reads.

ABSTRACT:

Motivation

The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data.

Results

We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, by adopting a data-dependent bucketing scheme and employing a number of encoding ideas, we can achieve substantially better compression ratios than existing de novo sequence compression tools, including other bucketing and reordering schemes. Our method, Mince, achieves up to a 45% reduction in file sizes (28% on average) compared with existing state-of-the-art de novo compression schemes.

Availability and implementation

Mince is written in C++11, is open source and has been made available under the GPLv3 license. It is available at http://www.cs.cmu.edu/∼ckingsf/software/mince.

Contact

carlk@cs.cmu.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Patro R

PROVIDER: S-EPMC4547610 | biostudies-literature | 2015 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Data-dependent bucketing improves reference-free compression of sequencing reads.

Patro Rob R Kingsford Carl C

Bioinformatics (Oxford, England) 20150424 17

<h4>Motivation</h4>The storage and transmission of high-throughput sequencing data consumes significant resources. As our capacity to produce such data continues to increase, this burden will only grow. One approach to reduce storage and transmission requirements is to compress this sequencing data.<h4>Results</h4>We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file. We demonstrate that, ...[more]

PMID: 25910696

Dataset Information

Data-dependent bucketing improves reference-free compression of sequencing reads.

Motivation

Results

Availability and implementation

Contact

Supplementary information

Publications

Data-dependent bucketing improves reference-free compression of sequencing reads.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Reference-free lossless compression of nanopore sequencing reads using an approximate assembly approach.
| S-EPMC9902536 | biostudies-literature

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph.
| S-EPMC4570262 | biostudies-literature

Using reference-free compressed data structures to analyze sequencing reads from thousands of human genomes.
| S-EPMC5287235 | biostudies-literature

Reference-free Association Mapping from Sequencing Reads Using k-mers.
| S-EPMC7842384 | biostudies-literature

The Nubeam reference-free approach to analyze metagenomic sequencing reads.
| S-EPMC7545149 | biostudies-literature

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads.
| S-EPMC3932469 | biostudies-literature

Reference-free phylogeny from sequencing data.
| S-EPMC10045052 | biostudies-literature

PgRC2: engineering the compression of sequencing reads.
| S-EPMC11908645 | biostudies-literature

Efficient storage of high throughput DNA sequencing data using reference-based compression.
| S-EPMC3083090 | biostudies-literature

Reads Binning Improves Alignment-Free Metagenome Comparison.
| S-EPMC6881972 | biostudies-literature