Unknown

Dataset Information

0

Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools.


ABSTRACT: This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156-186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize multiple processors, available memory, high-bandwidth storage, and hardware compression accelerators, if available. The sam2bam provides file format conversion between well-known genome file formats, from SAM to BAM, as a basic feature. Additional features such as analyzing, filtering, and converting input data are provided by using plug-in tools, e.g., duplicate marking, which can be attached to sam2bam at runtime. We demonstrated that sam2bam could significantly reduce the runtime of next generation sequencing (NGS) data pre-processing from about two hours to about one minute for a whole-exome data set on a 16-core single-node system using up to 130 GB of memory. The sam2bam could reduce the runtime of NGS data pre-processing from about 20 hours to about nine minutes for a whole-genome sequencing data set on the same system using up to 711 GB of memory.

SUBMITTER: Ogasawara T 

PROVIDER: S-EPMC5115855 | biostudies-other | 2016

REPOSITORIES: biostudies-other

altmetric image

Publications

Sam2bam: High-Performance Framework for NGS Data Preprocessing Tools.

Ogasawara Takeshi T   Cheng Yinhe Y   Tzeng Tzy-Hwa Kathy TK  

PloS one 20161118 11


This paper introduces a high-throughput software tool framework called sam2bam that enables users to significantly speed up pre-processing for next-generation sequencing data. The sam2bam is especially efficient on single-node multi-core large-memory systems. It can reduce the runtime of data pre-processing in marking duplicate reads on a single node system by 156-186x compared with de facto standard tools. The sam2bam consists of parallel software components that can fully utilize multiple proc  ...[more]

Similar Datasets

| S-EPMC6500068 | biostudies-literature
| S-EPMC5896349 | biostudies-literature
| S-EPMC3965850 | biostudies-literature
| S-EPMC7784926 | biostudies-literature
| S-EPMC2944196 | biostudies-literature
| S-EPMC7465801 | biostudies-literature
| S-EPMC3958706 | biostudies-literature
| S-EPMC6954637 | biostudies-literature
| S-EPMC4052814 | biostudies-literature
| S-EPMC6075764 | biostudies-literature