Dataset Information

Reference-based compression of short-read sequences using path encoding.

ABSTRACT:

Motivation

Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.

Results

We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. Our method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. We are able to encode RNA-seq reads using 3-11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. We also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

Availability and implementation

Source code and binaries freely available for download at http://www.cs.cmu.edu/?ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.

SUBMITTER: Kingsford C

PROVIDER: S-EPMC4481695 | biostudies-literature | 2015 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Reference-based compression of short-read sequences using path encoding.

Kingsford Carl C Patro Rob R

Bioinformatics (Oxford, England) 20150202 12

<h4>Motivation</h4>Storing, transmitting and archiving data produced by next-generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.<h4>Results</h4>We present here an approach to compression that reduces the difficulty of managing large-scale sequencing data. Our novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches ...[more]

PMID: 25649622

Dataset Information

Reference-based compression of short-read sequences using path encoding.

Motivation

Results

Availability and implementation

Publications

Reference-based compression of short-read sequences using path encoding.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Long-read mapping to repetitive reference sequences using Winnowmap2.
| S-EPMC10510034 | biostudies-literature

SRComp: short read sequence compression using burstsort and Elias omega coding.
| S-EPMC3862494 | biostudies-literature

Reference-free validation of short read data.
| S-EPMC2943903 | biostudies-literature

RFPlasmid: predicting plasmid sequences from short-read assembly data using machine learning.
| S-EPMC8743549 | biostudies-literature

ECHO: a reference-free short-read error correction algorithm.
| S-EPMC3129260 | biostudies-literature

Access to ultra-long IgG CDRH3 bovine antibody sequences using short read sequencing technology.
| S-EPMC8508064 | biostudies-literature

Bioinformatics Pipeline for Human Papillomavirus Short Read Genomic Sequences Classification Using Support Vector Machine.
| S-EPMC7412107 | biostudies-literature

Efficient storage of high throughput DNA sequencing data using reference-based compression.
| S-EPMC3083090 | biostudies-literature

Deconvolute individual genomes from metagenome sequences through short read clustering.
| S-EPMC7150542 | biostudies-literature

Targeted DNA-seq and RNA-seq of Reference Samples with Short-read and Long-read Sequencing.
| S-EPMC11329654 | biostudies-literature