Dataset Information

Fast genotyping of known SNPs through approximate k-mer matching.

ABSTRACT:

Motivation

As the volume of next-generation sequencing (NGS) data increases, faster algorithms become necessary. Although speeding up individual components of a sequence analysis pipeline (e.g. read mapping) can reduce the computational cost of analysis, such approaches do not take full advantage of the particulars of a given problem. One problem of great interest, genotyping a known set of variants (e.g. dbSNP or Affymetrix SNPs), is important for characterization of known genetic traits and causative disease variants within an individual, as well as the initial stage of many ancestral and population genomic pipelines (e.g. GWAS).

Results

We introduce lightweight assignment of variant alleles (LAVA), an NGS-based genotyping algorithm for a given set of SNP loci, which takes advantage of the fact that approximate matching of mid-size k-mers (with k = 32) can typically uniquely identify loci in the human genome without full read alignment. LAVA accurately calls the vast majority of SNPs in dbSNP and Affymetrix's Genome-Wide Human SNP Array 6.0 up to about an order of magnitude faster than standard NGS genotyping pipelines. For Affymetrix SNPs, LAVA has significantly higher SNP calling accuracy than existing pipelines while using as low as ∼5 GB of RAM. As such, LAVA represents a scalable computational method for population-level genotyping studies as well as a flexible NGS-based replacement for SNP arrays.

Availability and implementation

LAVA software is available at http://lava.csail.mit.edu

Contact

bab@mit.edu

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Shajii A

PROVIDER: S-EPMC5013917 | biostudies-literature | 2016 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Fast genotyping of known SNPs through approximate k-mer matching.

Shajii Ariya A Yorukoglu Deniz D William Yu Yun Y Berger Bonnie B

Bioinformatics (Oxford, England) 20160901 17

<h4>Motivation</h4>As the volume of next-generation sequencing (NGS) data increases, faster algorithms become necessary. Although speeding up individual components of a sequence analysis pipeline (e.g. read mapping) can reduce the computational cost of analysis, such approaches do not take full advantage of the particulars of a given problem. One problem of great interest, genotyping a known set of variants (e.g. dbSNP or Affymetrix SNPs), is important for characterization of known genetic trait ...[more]

PMID: 27587672

Dataset Information

Fast genotyping of known SNPs through approximate k-mer matching.

Motivation

Results

Availability and implementation

Contact

Supplementary information

Publications

Fast genotyping of known SNPs through approximate k-mer matching.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Columba: fast approximate pattern matching with optimized search schemes.
| S-EPMC12724072 | biostudies-literature

CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching.
| S-EPMC9438946 | biostudies-literature

Fulgor: a fast and compact k-mer index for large-scale matching and color queries.
| S-EPMC10810250 | biostudies-literature

SeArcH schemes for Approximate stRing mAtching.
| S-EPMC11915513 | biostudies-literature

Fast Open Modification Spectral Library Searching through Approximate Nearest Neighbor Indexing.
| S-EPMC6173621 | biostudies-literature

Fulgor: A fast and compact <i>k</i>-mer index for large-scale matching and color queries.
| S-EPMC10197524 | biostudies-literature

Fast open modification spectral library searching through approximate nearest neighbor indexing
2021-05-25 | PXD009861 | Pride

Improved algorithms for approximate string matching (extended abstract).
| S-EPMC2648743 | biostudies-literature

Fast approximate hierarchical clustering using similarity heuristics.
| S-EPMC2561018 | biostudies-literature

SMaSH: Sample matching using SNPs in humans.
| S-EPMC6936078 | biostudies-literature