Unknown

Dataset Information

0

Predicting protein-binding regions in RNA using nucleotide profiles and compositions.


ABSTRACT:

Background

Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use.

Results

We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others.

Conclusions

Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding .

SUBMITTER: Choi D 

PROVIDER: S-EPMC5374631 | biostudies-literature | 2017 Mar

REPOSITORIES: biostudies-literature

altmetric image

Publications

Predicting protein-binding regions in RNA using nucleotide profiles and compositions.

Choi Daesik D   Park Byungkyu B   Chae Hanju H   Lee Wook W   Han Kyungsook K  

BMC systems biology 20170314 Suppl 2


<h4>Background</h4>Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. R  ...[more]

Similar Datasets

| S-EPMC11374027 | biostudies-literature
| S-EPMC2597731 | biostudies-literature
| S-EPMC3760854 | biostudies-literature
| S-EPMC6233526 | biostudies-literature
| S-EPMC8733325 | biostudies-literature
| S-EPMC9539567 | biostudies-literature
| S-EPMC5977759 | biostudies-literature
| S-EPMC554833 | biostudies-literature
| S-EPMC3287504 | biostudies-literature
| S-EPMC2994895 | biostudies-literature