Dataset Information

The distribution of word matches between Markovian sequences with periodic boundary conditions.

ABSTRACT: Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D(2) statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D(2) statistic under the null hypothesis of identically and independently distributed letters have been studied extensively, but no comprehensive study of the D(2) distribution for biologically more realistic higher-order Markovian sequences exists. Here we derive exact formulas for the mean and variance of the D(2) statistic for Markovian sequences of any order, and demonstrate through Monte Carlo simulations that the entire distribution is accurately characterized by a Pólya-Aeppli distribution for sequence lengths of biological interest. The approach is novel in that Markovian dependency is defined for sequences with periodic boundary conditions, and this enables exact analytic formulas for the mean and variance to be derived. We also carry out a preliminary comparison between the approximate D(2) distribution computed with the theoretical mean and variance under a Markovian hypothesis and an empirical D(2) distribution from the human genome.

SUBMITTER: Burden CJ

PROVIDER: S-EPMC3880068 | biostudies-literature | 2014 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The distribution of word matches between Markovian sequences with periodic boundary conditions.

Burden Conrad J CJ Leopardi Paul P Forêt Sylvain S

Journal of computational biology : a journal of computational molecular cell biology 20131026 1

Word match counts have traditionally been proposed as an alignment-free measure of similarity for biological sequences. The D(2) statistic, which simply counts the number of exact word matches between two sequences, is a useful test bed for developing rigorous mathematical results, which can then be extended to more biologically useful measures. The distributional properties of the D(2) statistic under the null hypothesis of identically and independently distributed letters have been studied ext ...[more]

PMID: 24160839

Dataset Information

The distribution of word matches between Markovian sequences with periodic boundary conditions.

Publications

The distribution of word matches between Markovian sequences with periodic boundary conditions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences.
| S-EPMC1764478 | biostudies-literature

Structural Anisotropy in Polar Fluids Subjected to Periodic Boundary Conditions.
| S-EPMC3269192 | biostudies-other

Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points.
| S-EPMC6330006 | biostudies-literature

EMPIRE: a highly parallel semiempirical molecular orbital program: 2: periodic boundary conditions.
| S-EPMC4435633 | biostudies-literature

Electrically turning periodic structures in cholesteric layer with conical-planar boundary conditions.
| S-EPMC8052423 | biostudies-literature

Annotating large genomes with exact word matches.
| S-EPMC403711 | biostudies-literature

Hirshfeld atom refinement based on projector augmented wave densities with periodic boundary conditions.
| S-EPMC8895013 | biostudies-literature

Lipid and Peptide Diffusion in Bilayers: The Saffman-Delbruck Model and Periodic Boundary Conditions.
| S-EPMC6326097 | biostudies-literature

Crystalline Moduli of Polymers, Evaluated from Density Functional Theory Calculations under Periodic Boundary Conditions.
| S-EPMC6641976 | biostudies-literature

Asymmetric Periodic Boundary Conditions for All-Atom Molecular Dynamics and Coarse-Grained Simulations of Nucleic Acids.
| S-EPMC10544013 | biostudies-literature