Unknown

Dataset Information

0

CALQ: compression of quality values of aligned sequencing data.


ABSTRACT: Motivation:Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the coarseness of quality value quantization with a statistical genotyping model, we minimize the impact of the introduced distortion on downstream analyses. Results:We analyze the performance of several lossy compressors for quality values in terms of trade-off between the achieved compressed size (in bits per quality value) and the Precision and Recall achieved after running a variant calling pipeline over sequencing data of the well-known NA12878 individual. By compressing and reconstructing quality values with CALQ, we observe a better average variant calling performance than with the original data while achieving a size reduction of about one order of magnitude with respect to the state-of-the-art lossless compressors. Furthermore, we show that CALQ performs as good as or better than the state-of-the-art lossy compressors in terms of variant calling Recall and Precision for most of the analyzed datasets. Availability and implementation:CALQ is written in C?++ and can be downloaded from https://github.com/voges/calq. Contact:voges@tnt.uni-hannover.de or mhernaez@illinois.edu. Supplementary information:Supplementary data are available at Bioinformatics online.

SUBMITTER: Voges J 

PROVIDER: S-EPMC5946873 | biostudies-literature | 2018 May

REPOSITORIES: biostudies-literature

altmetric image

Publications

CALQ: compression of quality values of aligned sequencing data.

Voges Jan J   Ostermann Jörn J   Hernaez Mikel M  

Bioinformatics (Oxford, England) 20180501 10


<h4>Motivation</h4>Recent advancements in high-throughput sequencing technology have led to a rapid growth of genomic data. Several lossless compression schemes have been proposed for the coding of such data present in the form of raw FASTQ files and aligned SAM/BAM files. However, due to their high entropy, losslessly compressed quality values account for about 80% of the size of compressed files. For the quality values, we present a novel lossy compression scheme named CALQ. By controlling the  ...[more]

Similar Datasets

| S-EPMC3592443 | biostudies-literature
| S-EPMC5568552 | biostudies-literature
| S-EPMC5666573 | biostudies-literature
| S-EPMC5856090 | biostudies-other
| S-EPMC3868316 | biostudies-literature
| S-EPMC6330002 | biostudies-literature
| S-EPMC3832420 | biostudies-literature
| S-EPMC6969201 | biostudies-literature
| S-EPMC3606433 | biostudies-literature
| S-EPMC3932469 | biostudies-literature