Dataset Information

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

ABSTRACT:

Background

The number of k-words shared between two sequences is a simple and efficient alignment-free sequence comparison method. This statistic, D2, has been used for the clustering of EST sequences. Sequence comparison based on D2 is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistical distribution of D2, and asymptotic regimes have been derived. The distribution of approximate k-word matches has also been studied.

Results

We have computed the D2 optimal word size for various sequence lengths, and for both perfect and approximate word matches. Kolmogorov-Smirnov tests show D2 to have a compound Poisson distribution at the optimal word size for small sequence lengths (below 400 letters) and a normal distribution at the optimal word size for large sequence lengths (above 1600 letters). We find that the D2 statistic outperforms BLAST in the comparison of artificially evolved sequences, and performs similarly to other methods based on exact word matches. These results obtained with randomly generated sequences are also valid for sequences derived from human genomic DNA.

Conclusion

We have characterized the distribution of the D2 statistic at optimal word sizes. We find that the best trade-off between computational efficiency and accuracy is obtained with exact word matches. Given that our numerical tests have not included sequence shuffling, transposition or splicing, the improvements over existing methods reported here underestimate that expected in real sequences. Because of the linear run time and of the known normal asymptotic behavior, D2-based methods are most appropriate for large genomic sequences.

SUBMITTER: Foret S

PROVIDER: S-EPMC1764478 | biostudies-literature | 2006 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

Forêt Sylvain S Kantorovitz Miriam R MR Burden Conrad J CJ

BMC bioinformatics 20061218

<h4>Background</h4>The number of k-words shared between two sequences is a simple and efficient alignment-free sequence comparison method. This statistic, D2, has been used for the clustering of EST sequences. Sequence comparison based on D2 is extremely fast, its runtime is proportional to the size of the sequences under scrutiny, whereas alignment-based comparisons have a worst-case run time proportional to the square of the size. Recent studies have tackled the rigorous study of the statistic ...[more]

PMID: 17254306

Dataset Information

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

Background

Results

Conclusion

Publications

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Annotating large genomes with exact word matches.
| S-EPMC403711 | biostudies-literature

Pairwise alignment of nucleotide sequences using maximal exact matches.
| S-EPMC6528274 | biostudies-literature

The distribution of word matches between Markovian sequences with periodic boundary conditions.
| S-EPMC3880068 | biostudies-literature

Finding Maximal Exact Matches Using the r-Index.
| S-EPMC8902461 | biostudies-literature

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.
| S-EPMC6330006 | biostudies-literature

MONI: A Pangenomic Index for Finding Maximal Exact Matches.
| S-EPMC8892979 | biostudies-literature

Lightweight comparison of RNAs based on exact sequence-structure matches.
| S-EPMC2722993 | biostudies-literature

STing: accurate and ultrafast genomic profiling with exact sequence matches.
| S-EPMC7430640 | biostudies-literature

Dynamic correlations: exact and approximate methods for mutual information.
| S-EPMC10898342 | biostudies-literature

Dynamics of pre- and post-choice behaviour: rats approximate optimal strategy in a discrete-trial decision task.
| S-EPMC4345461 | biostudies-literature