Dataset Information

Predicting protein-binding regions in RNA using nucleotide profiles and compositions.

ABSTRACT:

Background

Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. Recent computational methods for finding protein-binding sites in RNAs have several drawbacks for practical use.

Results

We developed a new support vector machine (SVM) model for predicting protein-binding regions in mRNA sequences. The model uses sequence profiles constructed from log-odds scores of mono- and di-nucleotides and nucleotide compositions. The model was evaluated by standard 10-fold cross validation, leave-one-protein-out (LOPO) cross validation and independent testing. Since actual mRNA sequences have more non-binding regions than protein-binding regions, we tested the model on several datasets with different ratios of protein-binding regions to non-binding regions. The best performance of the model was obtained in a balanced dataset of positive and negative instances. 10-fold cross validation with a balanced dataset achieved a sensitivity of 91.6%, a specificity of 92.4%, an accuracy of 92.0%, a positive predictive value (PPV) of 91.7%, a negative predictive value (NPV) of 92.3% and a Matthews correlation coefficient (MCC) of 0.840. LOPO cross validation showed a lower performance than the 10-fold cross validation, but the performance remains high (87.6% accuracy and 0.752 MCC). In testing the model on independent datasets, it achieved an accuracy of 82.2% and an MCC of 0.656. Testing of our model and other state-of-the-art methods on a same dataset showed that our model is better than the others.

Conclusions

Sequence profiles of log-odds scores of mono- and di-nucleotides were much more powerful features than nucleotide compositions in finding protein-binding regions in RNA sequences. But, a slight performance gain was obtained when using the sequence profiles along with nucleotide compositions. These are preliminary results of ongoing research, but demonstrate the potential of our approach as a powerful predictor of protein-binding regions in RNA. The program and supporting data are available at http://bclab.inha.ac.kr/RBPbinding .

SUBMITTER: Choi D

PROVIDER: S-EPMC5374631 | biostudies-literature | 2017 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Predicting protein-binding regions in RNA using nucleotide profiles and compositions.

Choi Daesik D Park Byungkyu B Chae Hanju H Lee Wook W Han Kyungsook K

BMC systems biology 20170314 Suppl 2

<h4>Background</h4>Motivated by the increased amount of data on protein-RNA interactions and the availability of complete genome sequences of several organisms, many computational methods have been proposed to predict binding sites in protein-RNA interactions. However, most computational methods are limited to finding RNA-binding sites in proteins instead of protein-binding sites in RNAs. Predicting protein-binding sites in RNA is more challenging than predicting RNA-binding sites in proteins. R ...[more]

PMID: 28361677

Similar Datasets

Project description:Background: Pathological tissue remodeling such as fibrosis is developed in various cardiac diseases. As one of cardiac activated-myofibroblast protein markers, CKAP4 may be involved in this process and the mechanisms have not been explored. Methods: We assumed that CKAP4 held a role in the regulation of cardiac fibrotic remodeling as an RNA-binding protein. Using improved RNA immunoprecipitation and sequencing (iRIP-seq), we sought to analyze the RNAs bound by CKAP4 in normal atrial muscle (IP1 group) and remodeling fibrotic atrial muscle (IP2 group) from patients with cardiac valvular disease. Quantitative PCR and Western blotting were applied to identify CKAP4 mRNA and protein expression levels in human right atrium samples. Results: iRIP-seq was successfully performed, CKAP4-bound RNAs were characterized. By statistically analyzing the distribution of binding peaks in various regions on the reference human genome, we found that the reads of IP samples were mainly distributed in the intergenic and intron regions implying that CKAP4 is more inclined to combine non-coding RNAs. There were 913 overlapping binding peaks between the IP1 and IP2 groups. The top five binding motifs were obtained by HOMER, in which GGGAU was the binding sequence that appeared simultaneously in both IP groups. Binding peak-related gene cluster enrichment analysis demonstrated these genes were mainly involved in biological processes such as signal transduction, protein phosphorylation, axonal guidance, and cell connection. The signal pathways ranking most varied in the IP2 group compared to the IP1 group were relating to mitotic cell cycle, protein ubiquitination and nerve growth factor receptors. More impressively, peak analysis revealed the lncRNA-binding features of CKAP4 in both IP groups. Furthermore, qPCR verified CKAP4 differentially bound lncRNAs including LINC00504, FLJ22447, RP11-326N17.2, and HELLPAR in remodeling myocardial tissues when compared with normal myocardial tissues. Finally, the expression of CKAP4 is down-regulated in human remodeling fibrotic atrium. Conclusions: We reveal certain RNA-binding features of CKAP4 suggesting a relevant role as an unconventional RNA-binding protein in cardiac remodeling process. Deeper structural and functional analysis will be helpful to enrich the regulatory network of cardiac remodeling and to identify potential therapeutic targets.

Project description:BackgroundRNA-binding proteins (RBPs) play diverse roles in eukaryotic RNA processing. Despite their pervasive functions in coding and noncoding RNA biogenesis and regulation, elucidating the sequence specificities that define protein-RNA interactions remains a major challenge. Recently, CLIP-seq (Cross-linking immunoprecipitation followed by high-throughput sequencing) has been successfully implemented to study the transcriptome-wide binding patterns of SRSF1, PTBP1, NOVA and fox2 proteins. These studies either adopted traditional methods like Multiple EM for Motif Elicitation (MEME) to discover the sequence consensus of RBP's binding sites or used Z-score statistics to search for the overrepresented nucleotides of a certain size. We argue that most of these methods are not well-suited for RNA motif identification, as they are unable to incorporate the RNA structural context of protein-RNA interactions, which may affect to binding specificity. Here, we describe a novel model-based approach--RNAMotifModeler to identify the consensus of protein-RNA binding regions by integrating sequence features and RNA secondary structures.ResultsAs an example, we implemented RNAMotifModeler on SRSF1 (SF2/ASF) CLIP-seq data. The sequence-structural consensus we identified is a purine-rich octamer 'AGAAGAAG' in a highly single-stranded RNA context. The unpaired probabilities, the probabilities of not forming pairs, are significantly higher than negative controls and the flanking sequence surrounding the binding site, indicating that SRSF1 proteins tend to bind on single-stranded RNA. Further statistical evaluations revealed that the second and fifth bases of SRSF1octamer motif have much stronger sequence specificities, but weaker single-strandedness, while the third, fourth, sixth and seventh bases are far more likely to be single-stranded, but have more degenerate sequence specificities. Therefore, we hypothesize that nucleotide specificity and secondary structure play complementary roles during binding site recognition by SRSF1.ConclusionIn this study, we presented a computational model to predict the sequence consensus and optimal RNA secondary structure for protein-RNA binding regions. The successful implementation on SRSF1 CLIP-seq data demonstrates great potential to improve our understanding on the binding specificity of RNA binding proteins.

Dataset Information

Predicting protein-binding regions in RNA using nucleotide profiles and compositions.

Background

Results

Conclusions

Publications

Predicting protein-binding regions in RNA using nucleotide profiles and compositions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets