Dataset Information

Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.

ABSTRACT: As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows-Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-species cases examined, there was one majority member (i.e. Plasmodium falciparum or Brugia malayi) and one minority member (i.e. human or the Wolbachia endosymbiont wBm) of the sequence data. Increasing bwa-mem seed length from the default value reduced the number of read pairs from the majority sequence member that incorrectly aligned to the reference genome of the minority sequence member. Combining both source genomes into a single reference genome increased the specificity of mapping, while also reducing the central processing unit (CPU) time. In Plasmodium, at a seed length of 18 nt, 24.1 % of reads mapped to the human genome using 1.7±0.1 CPU hours, while 83.6 % of reads mapped to the Plasmodium genome using 0.2±0.0 CPU hours (total: 107.7 % reads mapping; in 1.9±0.1 CPU hours). In contrast, 97.1 % of the reads mapped to a combined Plasmodium-human reference in only 0.7±0.0 CPU hours. Overall, the results suggest that combining all references into a single reference database and using a 23 nt seed length reduces the computational time, while maximizing specificity. Similar results were found for simulated sequence reads from a mock metagenomic data set. We found similar improvements to computation time in a publicly available human-only data set.

SUBMITTER: Robinson KM

PROVIDER: S-EPMC5643015 | biostudies-literature | 2017 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.

Robinson Kelly M KM Hawkins Aziah S AS Santana-Cruz Ivette I Adkins Ricky S RS Shetty Amol C AC Nagaraj Sushma S Sadzewicz Lisa L Tallon Luke J LJ Rasko David A DA Fraser Claire M CM Mahurkar Anup A Silva Joana C JC Dunning Hotopp Julie C JC

Microbial genomics 20170708 9

As sequencing technologies have evolved, the tools to analyze these sequences have made similar advances. However, for multi-species samples, we observed important and adverse differences in alignment specificity and computation time for bwa- mem (Burrows-Wheeler aligner-maximum exact matches) relative to bwa-aln. Therefore, we sought to optimize bwa-mem for alignment of data from multi-species samples in order to reduce alignment time and increase the specificity of alignments. In the multi-spe ...[more]

PMID: 29114401

Dataset Information

Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.

Publications

Aligner optimization increases accuracy and decreases compute times in multi-species sequence data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

TM-Aligner: Multiple sequence alignment tool for transmembrane proteins with reduced time and improved accuracy.
| S-EPMC5624947 | biostudies-literature

Ferroelectric compute-in-memory annealer for combinatorial optimization problems.
| S-EPMC10948773 | biostudies-literature

Short Sequence Aligner Benchmarking for Chromatin Research.
| S-EPMC10531285 | biostudies-literature

A multi-sample approach increases the accuracy of transcript assembly.
| S-EPMC6825223 | biostudies-literature

Species abundance information improves sequence taxonomy classification accuracy.
| S-EPMC6789115 | biostudies-literature

MetaLogo: a heterogeneity-aware sequence logo generator and aligner.
| S-EPMC8921662 | biostudies-literature

Tree species richness decreases while species evenness increases with disturbance frequency in a natural boreal forest landscape.
| S-EPMC4739566 | biostudies-other

Regional occupancy increases for widespread species but decreases for narrowly distributed species in metacommunity time series.
| S-EPMC10020147 | biostudies-literature

Impact of Aligner, Normalization Method, and Sequencing Depth on TempO-seq Accuracy.
| S-EPMC9067045 | biostudies-literature

A high-throughput DNA sequence aligner for microbial ecology studies.
| S-EPMC2788221 | biostudies-literature