Dataset Information

SAMBLASTER: fast duplicate marking and structural variant read extraction.

ABSTRACT:

Motivation

Illumina DNA sequencing is now the predominant source of raw genomic data, and data volumes are growing rapidly. Bioinformatic analysis pipelines are having trouble keeping pace. A common bottleneck in such pipelines is the requirement to read, write, sort and compress large BAM files multiple times.

Results

We present SAMBLASTER, a tool that reduces the number of times such costly operations are performed. SAMBLASTER is designed to mark duplicates in read-sorted SAM files as a piped post-pass on DNA aligner output before it is compressed to BAM. In addition, it can simultaneously output into separate files the discordant read-pairs and/or split-read mappings used for structural variant calling. As an alignment post-pass, its own runtime overhead is negligible, while dramatically reducing overall pipeline complexity and runtime. As a stand-alone duplicate marking tool, it performs significantly better than PICARD or SAMBAMBA in terms of both speed and memory usage, while achieving nearly identical results.

Availability and implementation

SAMBLASTER is open-source C+ + code and freely available for download from https://github.com/GregoryFaust/samblaster.

SUBMITTER: Faust GG

PROVIDER: S-EPMC4147885 | biostudies-literature | 2014 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

SAMBLASTER: fast duplicate marking and structural variant read extraction.

Faust Gregory G GG Hall Ira M IM

Bioinformatics (Oxford, England) 20140507 17

<h4>Motivation</h4>Illumina DNA sequencing is now the predominant source of raw genomic data, and data volumes are growing rapidly. Bioinformatic analysis pipelines are having trouble keeping pace. A common bottleneck in such pipelines is the requirement to read, write, sort and compress large BAM files multiple times.<h4>Results</h4>We present SAMBLASTER, a tool that reduces the number of times such costly operations are performed. SAMBLASTER is designed to mark duplicates in read-sorted SAM fi ...[more]

PMID: 24812344

Similar Datasets

Project description:BackgroundStructural variant (SV) calling from DNA sequencing data has been challenging due to several factors, including the ambiguity of short-read alignments, multiple complex SVs in the same genomic region, and the lack of "truth" datasets for benchmarking. Additionally, caller choice, parameter settings, and alignment method are known to affect SV calling. However, the impact of FASTQ read order on SV calling has not been explored for long-read data.ResultsHere, we used PacBio DNA sequencing data from 15 Caenorhabditis elegans strains and four Arabidopsis thaliana ecotypes to evaluate the sensitivity of different SV callers on FASTQ read order. Comparisons of variant call format files generated from the original and permutated FASTQ files demonstrated that the order of input data affected the SVs predicted by each caller. In particular, pbsv was highly sensitive to the order of the input data, especially at the highest depths where over 70% of the SV calls generated from pairs of differently ordered FASTQ files were in disagreement. These demonstrate that read order sensitivity is a complex, multifactorial process, as the differences observed both within and between species varied considerably according to the specific combination of aligner, SV caller, and sequencing depth. In addition to the SV callers being sensitive to the input data order, the SAMtools alignment sorting algorithm was identified as a source of variability following read order randomization.ConclusionThe results of this study highlight the sensitivity of SV calling on the order of reads encoded in FASTQ files, which has not been recognized in long-read approaches. These findings have implications for the replication of SV studies and the development of consistent SV calling protocols. Our study suggests that researchers should pay attention to the input order sensitivity of read alignment sorting methods when analyzing long-read sequencing data for SV calling, as mitigating a source of variability could facilitate future replication work. These results also raise important questions surrounding the relationship between SV caller read order sensitivity and tool performance. Therefore, tool developers should also consider input order sensitivity as a potential source of variability during the development and benchmarking of new and improved methods for SV calling.

Dataset Information

SAMBLASTER: fast duplicate marking and structural variant read extraction.

Motivation

Results

Availability and implementation

Publications

SAMBLASTER: fast duplicate marking and structural variant read extraction.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets