Unknown

Dataset Information

0

Fast alignment-free sequence comparison using spaced-word frequencies.


ABSTRACT: MOTIVATION: Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these methods is that neighbouring word matches are far from independent. RESULTS: To reduce the statistical dependency between adjacent word matches, we propose to use 'spaced words', defined by patterns of 'match' and 'don't care' positions, for alignment-free sequence comparison. We describe a fast implementation of this approach using recursive hashing and bit operations, and we show that further improvements can be achieved by using multiple patterns instead of single patterns. To evaluate our approach, we use spaced-word frequencies as a basis for fast phylogeny reconstruction. Using real-world and simulated sequence data, we demonstrate that our multiple-pattern approach produces better phylogenies than approaches relying on contiguous words. AVAILABILITY AND IMPLEMENTATION: Our program is freely available at http://spaced.gobics.de/.

SUBMITTER: Leimeister CA 

PROVIDER: S-EPMC4080745 | biostudies-literature | 2014 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

Fast alignment-free sequence comparison using spaced-word frequencies.

Leimeister Chris-Andre CA   Boden Marcus M   Horwege Sebastian S   Lindner Sebastian S   Morgenstern Burkhard B  

Bioinformatics (Oxford, England) 20140403 14


<h4>Motivation</h4>Alignment-free methods for sequence comparison are increasingly used for genome analysis and phylogeny reconstruction; they circumvent various difficulties of traditional alignment-based approaches. In particular, alignment-free methods are much faster than pairwise or multiple alignments. They are, however, less accurate than methods based on sequence alignment. Most alignment-free approaches work by comparing the word composition of sequences. A well-known problem with these  ...[more]

Similar Datasets

| S-EPMC5409309 | biostudies-literature
| S-EPMC3799466 | biostudies-literature
| S-EPMC6937637 | biostudies-literature
| S-EPMC6330006 | biostudies-literature
| S-EPMC6659240 | biostudies-literature
| S-EPMC10311327 | biostudies-literature
| S-EPMC3123933 | biostudies-literature
| S-EPMC2818754 | biostudies-literature
| S-EPMC5627421 | biostudies-literature
| S-EPMC3704055 | biostudies-literature