Unknown

Dataset Information

0

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads.


ABSTRACT: Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data.We developed Hierarchical mUlti-reference Genome cOmpression (HUGO), a novel compression algorithm for aligned reads in the sorted Sequence Alignment/Map (SAM) format. We first aligned short reads against a reference genome and stored exactly mapped reads for compression. For the inexact mapped or unmapped reads, we realigned them against different reference genomes using an adaptive scheme by gradually shortening the read length. Regarding the base quality value, we offer lossy and lossless compression mechanisms. The lossy compression mechanism for the base quality values uses k-means clustering, where a user can adjust the balance between decompression quality and compression rate. The lossless compression can be produced by setting k (the number of clusters) to the number of different quality values.The proposed method produced a compression ratio in the range 0.5-0.65, which corresponds to 35-50% storage savings based on experimental datasets. The proposed approach achieved 15% more storage savings over CRAM and comparable compression ratio with Samcomp (CRAM and Samcomp are two of the state-of-the-art genome compression algorithms). The software is freely available at https://sourceforge.net/projects/hierachicaldnac/with a General Public License (GPL) license.Our method requires having different reference genomes and prolongs the execution time for additional alignments.The proposed multi-reference-based compression algorithm for aligned reads outperforms existing single-reference based algorithms.

SUBMITTER: Li P 

PROVIDER: S-EPMC3932469 | biostudies-literature | 2014 Mar-Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads.

Li Pinghao P   Jiang Xiaoqian X   Wang Shuang S   Kim Jihoon J   Xiong Hongkai H   Ohno-Machado Lucila L  

Journal of the American Medical Informatics Association : JAMIA 20131224 2


<h4>Background and objective</h4>Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data.<h4>Methods</h4>We developed Hierarchical mUlti-reference Genome cOmpression (HUGO), a novel compression algorithm for aligned reads in the sort  ...[more]

Similar Datasets

| S-EPMC4547610 | biostudies-literature
| S-EPMC6440077 | biostudies-literature
| S-EPMC5870704 | biostudies-literature
| S-EPMC5666573 | biostudies-literature
2015-10-10 | GSE71191 | GEO
| S-EPMC7459848 | biostudies-literature
2017-07-30 | GSE101815 | GEO
| S-EPMC5946873 | biostudies-literature
| S-EPMC3624798 | biostudies-literature
| S-EPMC3592443 | biostudies-literature