Dataset Information

TRCMGene: A two-step referential compression method for the efficient storage of genetic data.

ABSTRACT:

Background

The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable length of time for decompression and loading.

Results

Here, we propose TRCMGene, a lossless genetic data compression method that uses a referential compression scheme. The novel concept of two-step compression method, which builds an index structure using K-means and k-nearest neighbours, is introduced to TRCMGene. Evaluation with several real datasets revealed that the compression factor of TRCMGene ranges from 9 to 21. TRCMGene presents a good balance between compression factor and reading time. On average, the reading time of compressed data is 60% of that of uncompressed data. Thus, TRCMGene not only saves disc space but also saves file access time and speeds up data loading. These effects collectively improve genetic data storage and transmission in the current hardware environment and render system upgrades unnecessary. TRCMGene, user manual and demos could be accessed freely from https://github.com/tangyou79/TRCM. The data mentioned in this manuscript could be downloaded from: https://github.com/tangyou79/TRCM/wiki.

SUBMITTER: Tang Y

PROVIDER: S-EPMC6218042 | biostudies-literature | 2018

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

TRCMGene: A two-step referential compression method for the efficient storage of genetic data.

Tang You Y Li Min M Sun Jing J Zhang Tao T Zhang Jicheng J Zheng Ping P

PloS one 20181105 11

<h4>Background</h4>The massive quantities of genetic data generated by high-throughput sequencing pose challenges to data storage, transmission and analyses. These problems are effectively solved through data compression, in which the size of data storage is reduced and the speed of data transmission is improved. Several options are available for compressing and storing genetic data. However, most of these options either do not provide sufficient compression rates or require a considerable lengt ...[more]

PMID: 30395579

Dataset Information

TRCMGene: A two-step referential compression method for the efficient storage of genetic data.

Background

Results

Publications

TRCMGene: A two-step referential compression method for the efficient storage of genetic data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

HRCM: An Efficient Hybrid Referential Compression Method for Genomic Big Data.
| S-EPMC6930768 | biostudies-literature

Comment on: 'ERGC: an efficient referential genome compression algorithm'.
| S-EPMC4907388 | biostudies-literature

A novel compression tool for efficient storage of genome resequencing data.
| S-EPMC3074166 | biostudies-literature

Efficient storage of high throughput DNA sequencing data using reference-based compression.
| S-EPMC3083090 | biostudies-literature

Efficient genotype compression and analysis of large genetic-variation data sets.
| S-EPMC4697868 | biostudies-literature

Two-Step Freezing Polymerization Method for Efficient Synthesis of High-Performance Stimuli-Responsive Hydrogels.
| S-EPMC7098024 | biostudies-literature

On-Demand Indexing for Referential Compression of DNA Sequences.
| S-EPMC4493149 | biostudies-literature

TsImpute: an accurate two-step imputation method for single-cell RNA-seq data.
| S-EPMC10724850 | biostudies-literature

Highly efficient production of soluble proteins from insoluble inclusion bodies by a two-step-denaturing and refolding method.
| S-EPMC3146519 | biostudies-literature

sTarPicker: a method for efficient prediction of bacterial sRNA targets based on a two-step model for hybridization.
| S-EPMC3142192 | biostudies-literature