Dataset Information

Extracting sequence features to predict protein-DNA interactions: a comparative study.

ABSTRACT: Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF-DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through a regression framework; and influential sequence features are identified by variable selection. We examine a few state-of-the-art learning methods including stepwise linear regression, multivariate adaptive regression splines, neural networks, support vector machines, boosting and Bayesian additive regression trees (BART). These methods are applied to both simulated datasets and two whole-genome ChIP-chip datasets on the TFs Oct4 and Sox2, respectively, in human embryonic stem cells. We find that, with proper learning methods, predictive modeling approaches can significantly improve the predictive power and identify more biologically interesting features, such as TF-TF interactions, than the PWM approach. In particular, BART and boosting show the best and the most robust overall performance among all the methods.

SUBMITTER: Zhou Q

PROVIDER: S-EPMC2475627 | biostudies-literature | 2008 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Extracting sequence features to predict protein-DNA interactions: a comparative study.

Zhou Qing Q Liu Jun S JS

Nucleic acids research 20080613 12

Predicting how and where proteins, especially transcription factors (TFs), interact with DNA is an important problem in biology. We present here a systematic study of predictive modeling approaches to the TF-DNA binding problem, which have been frequently shown to be more efficient than those methods only based on position-specific weight matrices (PWMs). In these approaches, a statistical relationship between genomic sequences and gene expression or ChIP-binding intensities is inferred through ...[more]

PMID: 18556756

Dataset Information

Extracting sequence features to predict protein-DNA interactions: a comparative study.

Publications

Extracting sequence features to predict protein-DNA interactions: a comparative study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Extracting DNA words based on the sequence features: non-uniform distribution and integrity.
| S-EPMC4727310 | biostudies-literature

Quantifying sequence and structural features of protein-RNA interactions.
| S-EPMC4150784 | biostudies-literature

TT3D: Leveraging precomputed protein 3D sequence models to predict protein-protein interactions.
| S-EPMC10640393 | biostudies-literature

Sequence Features Accurately Predict Genome-wide MeCP2 Binding in vivo
2016-03-24 | E-GEOD-71126 | biostudies-arrayexpress

A comparative study of protein-ssDNA interactions.
| S-EPMC7902235 | biostudies-literature

Sequence Features Accurately Predict Genome-wide MeCP2 Binding in vivo
2016-03-24 | GSE71126 | GEO

Comparative interactomics for virus-human protein-protein interactions: DNA viruses versus RNA viruses.
| S-EPMC5221455 | biostudies-literature

Extracting interpretable features for pathologists using weakly supervised learning to predict p16 expression in oropharyngeal cancer.
| S-EPMC10894206 | biostudies-literature

Insights into protein-DNA interactions from hydrogen bond energy-based comparative protein-ligand analyses.
| S-EPMC9018545 | biostudies-literature

Divalent Ion-Mediated DNA-DNA Interactions: A Comparative Study of Triplex and Duplex.
| S-EPMC5549645 | biostudies-literature