Dataset Information

SparkGC: Spark based genome compression for large collections of genomes.

ABSTRACT: Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark's in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC .

SUBMITTER: Yao H

PROVIDER: S-EPMC9310413 | biostudies-literature | 2022 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SparkGC: Spark based genome compression for large collections of genomes.

Yao Haichang H Hu Guangyong G Liu Shangdong S Fang Houzhi H Ji Yimu Y

BMC bioinformatics 20220725 1

Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spar ...[more]

PMID: 35879669

Similar Datasets

Project description:BackgroundTo what degree a string of symbols can be compressed reveals important details about its complexity. For instance, strings that are not compressible are random and carry a low information potential while the opposite is true for highly compressible strings. We explore to what extent microbial genomes are amenable to compression as they vary considerably both with respect to size and base composition. For instance, microbial genome sizes vary from less than 100,000 base pairs in symbionts to more than 10 million in soil-dwellers. Genomic base composition, often summarized as genomic AT or GC content due to the similar frequencies of adenine and thymine on one hand and cytosine and guanine on the other, also vary substantially; the most extreme microbes can have genomes with AT content below 25% or above 85% AT. Base composition determines the frequency of DNA words, consisting of multiple nucleotides or oligonucleotides, and may therefore also influence compressibility. Using 4,713 RefSeq genomes, we examined the association between compressibility, using both a DNA based- (MBGC) and a general purpose (ZPAQ) compression algorithm, and genome size, AT content as well as genomic oligonucleotide usage variance (OUV) using generalized additive models.ResultsWe find that genome size (p < 0.001) and OUV (p < 0.001) are both strongly associated with genome redundancy for both type of file compressors. The DNA-based MBGC compressor managed to improve compression with approximately 3% on average with respect to ZPAQ. Moreover, MBGC detected a significant (p < 0.001) compression ratio difference between AT poor and AT rich genomes which was not detected with ZPAQ.ConclusionAs lack of compressibility is equivalent to randomness, our findings suggest that smaller and AT rich genomes may have accumulated more random mutations on average than larger and AT poor genomes which, in turn, were significantly more redundant. Moreover, we find that OUV is a strong proxy for genome compressibility in microbial genomes. The ZPAQ compressor was found to agree with the MBGC compressor, albeit with a poorer performance, except for the compressibility of AT-rich and AT-poor/GC-rich genomes.

Project description:This study details a workflow used to accession a large stonefly (Plecoptera) collection resulting from several donations. The eastern North American material of Kenneth W. Stewart (deceased, University of North Texas), the entire collection of Stanley W. Szczytko (deceased, University of Wisconsin, Stevens Point), and a small portion of the Barry C. Poulton collection (active, United States Geological Survey, Columbia, Missouri) were donated to the Illinois Natural History Survey in 2013. These 5,767 vials of specimens were processed to help preserve the specimen legacy of these world renowned Plecoptera researchers. The workflow used an industrialized approach to organize the specimens taxonomically, image the specimens and labels, and place the specimens into new storage. Utilizing the images as a verbatim data source, we transcribed labels in iterative steps that yielded more information with each pass. The data were normalized, locations georeferenced, all specimen data formatted to meet Darwin Core Archive format for occurrence data, and a data set created using Pensoft's Integrated Publishing Toolkit. This is the first time that any of the specimen data has been made available electronically. We also provide two important electronic supplements that include the Bill P. Stark (active, Mississippi College) Oklahoma field notebook for 1971 and 1972 detailing locations for many coded stonefly specimens in the Stewart collection, and the coded locations of B. C. Poulton's Arkansas and Missouri study. Again, we have linked coded labels in vials to normalized and georefenced site data. We confirmed 243 stonefly species were contained within the collections, and the potential for many more species exists among the specimens identified to family and genus level. Twenty-one new state, province, and other significant stonefly records are reported herein with all identifications verified by the senior author, often through consultation with other stonefly taxonomists. Researchers are encouraged to utilize the specimen data, form collaborations with the authors, and borrow specimens for research.

Dataset Information

SparkGC: Spark based genome compression for large collections of genomes.

Publications

SparkGC: Spark based genome compression for large collections of genomes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets