Dataset Information

Improving protein function prediction methods with integrated literature data.

ABSTRACT:

Background

Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity.

Results

We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial.

Conclusion

Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.

SUBMITTER: Gabow AP

PROVIDER: S-EPMC2375131 | biostudies-literature | 2008 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improving protein function prediction methods with integrated literature data.

Gabow Aaron P AP Leach Sonia M SM Baumgartner William A WA Hunter Lawrence E LE Goldberg Debra S DS

BMC bioinformatics 20080415

<h4>Background</h4>Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein- ...[more]

PMID: 18412966

Dataset Information

Improving protein function prediction methods with integrated literature data.

Background

Results

Conclusion

Publications

Improving protein function prediction methods with integrated literature data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Improving protein function prediction using protein sequence and GO-term similarities.
| S-EPMC6449755 | biostudies-literature

Dynameomics: data-driven methods and models for utilizing large-scale protein structure repositories for improving fragment-based loop prediction.
| S-EPMC4241109 | biostudies-literature

INGA 2.0: improving protein function prediction for the dark proteome.
| S-EPMC6602455 | biostudies-literature

Improving protein function prediction by learning and integrating representations of protein sequences and function labels.
| S-EPMC11374024 | biostudies-literature

SDN2GO: An Integrated Deep Learning Model for Protein Function Prediction.
| S-EPMC7201018 | biostudies-literature

NetGO: improving large-scale protein function prediction with massive network information.
| S-EPMC6602452 | biostudies-literature

Integrated protein function prediction by mining function associations, sequences, and protein-protein and gene-gene interaction networks.
| S-EPMC4894840 | biostudies-literature

Exploring function prediction in protein interaction networks via clustering methods.
| S-EPMC4074043 | biostudies-literature

Probabilistic protein function prediction from heterogeneous genome-wide data.
| S-EPMC1828618 | biostudies-literature

Computational prediction of protein interfaces: A review of data driven methods.
| S-EPMC4655202 | biostudies-literature