Unknown

Dataset Information

0

Minimally-overlapping words for sequence similarity search.


ABSTRACT:

Motivation

Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via "seeds": simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence.

Results

Here we study a simple sparse-seeding method: using seeds at positions of certain "words" (e.g. ac, at, gc, or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally-overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed "minimizer" sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it.

Availability and implementation

Software to design and test minimally-overlapping words is freely available at https://gitlab.com/mcfrith/noverlap.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Frith MC 

PROVIDER: S-EPMC8016470 | biostudies-literature |

REPOSITORIES: biostudies-literature

Similar Datasets

| S-EPMC4699916 | biostudies-literature
| S-EPMC3213098 | biostudies-literature
| S-EPMC5666806 | biostudies-literature
| S-EPMC2587480 | biostudies-literature
| S-EPMC4460465 | biostudies-literature
| S-EPMC5274646 | biostudies-literature
| S-EPMC8570820 | biostudies-literature
| S-EPMC1421445 | biostudies-literature
| S-EPMC523596 | biostudies-other
| S-EPMC7494772 | biostudies-literature