Dataset Information

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

ABSTRACT: Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. However, the existent solutions show limited scalability and have a complex implementation. In this work we introduce SparkBWA, a new tool that exploits the capabilities of a big data technology as Spark to boost the performance of one of the most widely adopted aligner, the Burrows-Wheeler Aligner (BWA). The design of SparkBWA uses two independent software layers in such a way that no modifications to the original BWA source code are required, which assures its compatibility with any BWA version (future or legacy). SparkBWA is evaluated in different scenarios showing noticeable results in terms of performance and scalability. A comparison to other parallel BWA-based aligners validates the benefits of our approach. Finally, an intuitive and flexible API is provided to NGS professionals in order to facilitate the acceptance and adoption of the new tool. The source code of the software described in this paper is publicly available at https://github.com/citiususc/SparkBWA, with a GPL3 license.

SUBMITTER: Abuin JM

PROVIDER: S-EPMC4868289 | biostudies-literature | 2016

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

Abuín José M JM Pichel Juan C JC Pena Tomás F TF Amigo Jorge J

PloS one 20160516 5

Next-generation sequencing (NGS) technologies have led to a huge amount of genomic data that need to be analyzed and interpreted. This fact has a huge impact on the DNA sequence alignment process, which nowadays requires the mapping of billions of small DNA sequences onto a reference genome. In this way, sequence alignment remains the most time-consuming stage in the sequence analysis workflow. To deal with this issue, state of the art aligners take advantage of parallelization strategies. Howev ...[more]

PMID: 27182962

Dataset Information

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

Publications

SparkBWA: Speeding Up the Alignment of High-Throughput DNA Sequencing Data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

SparkEC: speeding up alignment-based DNA error correction tools.
| S-EPMC9639292 | biostudies-literature

RAMICS: trainable, high-speed and biologically relevant alignment of high-throughput sequencing reads to coding DNA.
| S-EPMC4117746 | biostudies-literature

Denoising DNA deep sequencing data-high-throughput sequencing errors and their correction.
| S-EPMC4719071 | biostudies-literature

DiNAMO: highly sensitive DNA motif discovery in high-throughput sequencing data.
| S-EPMC5996464 | biostudies-literature

THetA: inferring intra-tumor heterogeneity from high-throughput DNA sequencing data.
| S-EPMC4054893 | biostudies-literature

MUSSELS: Speeding up the detection of invasive aquatic species using environmental DNA and nanopore sequencing
| S-BSST391 | biostudies-other

Efficient storage of high throughput DNA sequencing data using reference-based compression.
| S-EPMC3083090 | biostudies-literature

Substantial biases in ultra-short read data sets from high-throughput DNA sequencing.
| S-EPMC2532726 | biostudies-literature

High-Throughput DNA sequencing of ancient wood.
| S-EPMC5896730 | biostudies-literature

Compression of structured high-throughput sequencing data.
| S-EPMC3832420 | biostudies-literature