Unknown

Dataset Information

0

Prediction of DNA-binding residues from protein sequence information using random forests.


ABSTRACT: BACKGROUND: Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from amino acid sequence data. RESULTS: A new machine learning approach has been developed in this study for predicting DNA-binding residues from amino acid sequence data. The approach used both the labelled data instances collected from the available structures of protein-DNA complexes and the abundant unlabeled data found in protein sequence databases. The evolutionary information contained in the unlabeled sequence data was represented as position-specific scoring matrices (PSSMs) and several new descriptors. The sequence-derived features were then used to train random forests (RFs), which could handle a large number of input variables and avoid model overfitting. The use of evolutionary information was found to significantly improve classifier performance. The RF classifier was further evaluated using a separate test dataset, and the predicted DNA-binding residues were examined in the context of three-dimensional structures. CONCLUSION: The results suggest that the RF-based approach gives rise to more accurate prediction of DNA-binding residues than previous studies. A new web server called BindN-RF http://bioinfo.ggc.org/bindn-rf/ has thus been developed to make the RF classifier accessible to the biological research community.

SUBMITTER: Wang L 

PROVIDER: S-EPMC2709252 | biostudies-literature | 2009

REPOSITORIES: biostudies-literature

altmetric image

Publications

Prediction of DNA-binding residues from protein sequence information using random forests.

Wang Liangjiang L   Yang Mary Qu MQ   Yang Jack Y JY  

BMC genomics 20090707


<h4>Background</h4>Protein-DNA interactions are involved in many biological processes essential for cellular function. To understand the molecular mechanism of protein-DNA recognition, it is necessary to identify the DNA-binding residues in DNA-binding proteins. However, structural data are available for only a few hundreds of protein-DNA complexes. With the rapid accumulation of sequence data, it becomes an important but challenging task to accurately predict DNA-binding residues directly from  ...[more]

Similar Datasets

| S-EPMC4532418 | biostudies-literature
| S-EPMC4271564 | biostudies-literature
| S-EPMC3577447 | biostudies-literature
| S-EPMC2651179 | biostudies-literature
| S-EPMC3530872 | biostudies-literature
| S-EPMC3885575 | biostudies-literature
| S-EPMC2638931 | biostudies-literature
| S-EPMC4329842 | biostudies-literature
| S-EPMC7387700 | biostudies-literature
| S-EPMC7246089 | biostudies-literature