Unknown

Dataset Information

0

Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.


ABSTRACT: Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores.

SUBMITTER: Baharav TZ 

PROVIDER: S-EPMC7660437 | biostudies-literature | 2020 Sep

REPOSITORIES: biostudies-literature

altmetric image

Publications

Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments.

Baharav Tavor Z TZ   Kamath Govinda M GM   Tse David N DN   Shomorony Ilan I  

Patterns (New York, N.Y.) 20200731 6


Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise <i>k</i>-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the <i>k</i>-mer distribution of a dataset is significantly non-uniform (e.g., due to GC bia  ...[more]

Similar Datasets

| S-EPMC2850363 | biostudies-literature
| S-EPMC11838578 | biostudies-literature
| S-EPMC8087094 | biostudies-literature
| S-EPMC4748600 | biostudies-literature
| S-EPMC3338010 | biostudies-literature
| S-EPMC3289921 | biostudies-literature
| S-EPMC11398343 | biostudies-literature
| S-EPMC4175719 | biostudies-literature
| S-EPMC4723085 | biostudies-literature
| S-EPMC1131888 | biostudies-literature