Dataset Information

Robust k-mer frequency estimation using gapped k-mers.

ABSTRACT: Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechanisms. However, a fundamental limitation in the use of k-mers as sequence features is that as k is increased, larger spatial correlations in DNA sequence elements can be described, but the frequency of observing any specific k-mer becomes very small, and rapidly approaches a sparse matrix of binary counts. Thus any statistical learning approach using k-mers will be susceptible to noisy estimation of k-mer frequencies once k becomes large. Because all molecular DNA interactions have limited spatial extent, gapped k-mers often carry the relevant biological signal. Here we use gapped k-mer counts to more robustly estimate the ungapped k-mer frequencies, by deriving an equation for the minimum norm estimate of k-mer frequencies given an observed set of gapped k-mer frequencies. We demonstrate that this approach provides a more accurate estimate of the k-mer frequencies in real biological sequences using a sample of CTCF binding sites in the human genome.

SUBMITTER: Ghandi M

PROVIDER: S-EPMC3895138 | biostudies-literature | 2014 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Robust k-mer frequency estimation using gapped k-mers.

Ghandi Mahmoud M Mohammad-Noori Morteza M Beer Michael A MA

Journal of mathematical biology 20130717 2

Oligomers of fixed length, k, commonly known as k-mers, are often used as fundamental elements in the description of DNA sequence features of diverse biological function, or as intermediate elements in the constuction of more complex descriptors of sequence features such as position weight matrices. k-mers are very useful as general sequence features because they constitute a complete and unbiased feature set, and do not require parameterization based on incomplete knowledge of biological mechan ...[more]

PMID: 23861010

Dataset Information

Robust k-mer frequency estimation using gapped k-mers.

Publications

Robust k-mer frequency estimation using gapped k-mers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Enhanced regulatory sequence prediction using gapped k-mer features.
| S-EPMC4102394 | biostudies-literature

Correction: Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features
| S-EPMC4250198 | biostudies-literature

Frequency estimation using pool sequencing
| PRJEB22590 | ENA

Recombination spot identification Based on gapped k-mers.
| S-EPMC4814916 | biostudies-literature

GkmExplain: fast and accurate interpretation of nonlinear gapped k-mer SVMs.
| S-EPMC6612808 | biostudies-literature

X-Mapper: fast and accurate sequence alignment via gapped x-mers.
| S-EPMC11755882 | biostudies-literature

ScanITD: Detecting internal tandem duplication with robust variant allele frequency estimation.
| S-EPMC7450668 | biostudies-literature

Rooibos (Aspalathus linearis) Genome Size Estimation Using Flow Cytometry and K-Mer Analyses.
| S-EPMC7076435 | biostudies-literature

Significant reductions in human visual gamma frequency by the gaba reuptake inhibitor tiagabine revealed by robust peak frequency estimation.
| S-EPMC5082569 | biostudies-literature

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure.
| S-EPMC4111482 | biostudies-literature