Unknown

Dataset Information

0

Enhanced regulatory sequence prediction using gapped k-mer features.


ABSTRACT: Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes susceptible to noisy training set k-mer frequencies once k becomes large. To address this problem, we introduce alternative feature sets using gapped k-mers, a new classifier, gkm-SVM, and a general method for robust estimation of k-mer frequencies. To make the method applicable to large-scale genome wide applications, we develop an efficient tree data structure for computing the kernel matrix. We show that compared to our original kmer-SVM and alternative approaches, our gkm-SVM predicts functional genomic regulatory elements and tissue specific enhancers with significantly improved accuracy, increasing the precision by up to a factor of two. We then show that gkm-SVM consistently outperforms kmer-SVM on human ENCODE ChIP-seq datasets, and further demonstrate the general utility of our method using a Naïve-Bayes classifier. Although developed for regulatory sequence analysis, these methods can be applied to any sequence classification problem.

SUBMITTER: Ghandi M 

PROVIDER: S-EPMC4102394 | biostudies-literature | 2014 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

Enhanced regulatory sequence prediction using gapped k-mer features.

Ghandi Mahmoud M   Lee Dongwon D   Mohammad-Noori Morteza M   Beer Michael A MA  

PLoS computational biology 20140717 7


Oligomers of length k, or k-mers, are convenient and widely used features for modeling the properties and functions of DNA and protein sequences. However, k-mers suffer from the inherent limitation that if the parameter k is increased to resolve longer features, the probability of observing any specific k-mer becomes very small, and k-mer counts approach a binary variable, with most k-mers absent and a few present once. Thus, any statistical learning approach using k-mers as features becomes sus  ...[more]

Similar Datasets

| S-EPMC3895138 | biostudies-literature
| S-EPMC2894511 | biostudies-literature
| S-EPMC6612808 | biostudies-other
| S-EPMC2737730 | biostudies-literature
| S-EPMC2638152 | biostudies-literature
| S-EPMC4087249 | biostudies-literature
| S-EPMC8390822 | biostudies-literature
| S-EPMC6480087 | biostudies-literature
| S-EPMC6402319 | biostudies-literature
| S-EPMC2828123 | biostudies-literature