Dataset Information

A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions.

ABSTRACT:

Background

Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity.

Methodology

Our statistical model is based on sets of proteins with experimentally validated functions and numeric measures of function specificity and function similarity derived from the Gene Ontology. The model predicts the similarity of function between two proteins given their amino acid sequence similarity measured by statistics from the BLAST sequence alignment algorithm. A novel aspect of our model is that it predicts the degree of function similarity shared between two proteins over a continuous range of sequence similarity, facilitating prediction of function with an appropriate level of specificity.

Significance

Our model shows nearly exact function similarity for proteins with high sequence similarity (bit score >244.7, e-value >1e(-62), non-redundant NCBI protein database (NRDB)) and only small likelihood of specific function match for proteins with low sequence similarity (bit score <54.6, e-value <1e(-05), NRDB). For sequence similarity ranges in between our annotation model shows an increasing relationship between function similarity and sequence similarity, but with considerable variability. We applied the model to a large set of proteins of unknown function, and predicted functions for thousands of these proteins ranging from general to very specific. We also applied the model to a data set of proteins with previously assigned, specific functions that were electronically based. We show that, on average, these prior function predictions are more specific (quite possibly overly-specific) compared to predictions from our model that is based on proteins with experimentally determined function.

SUBMITTER: Louie B

PROVIDER: S-EPMC2760442 | biostudies-literature | 2009 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions.

Louie Brenton B Higdon Roger R Kolker Eugene E

PloS one 20091021 10

<h4>Background</h4>Predicting protein function from primary sequence is an important open problem in modern biology. Not only are there many thousands of proteins of unknown function, current approaches for predicting function must be improved upon. One problem in particular is overly-specific function predictions which we address here with a new statistical model of the relationship between protein sequence similarity and protein function similarity.<h4>Methodology</h4>Our statistical model is ...[more]

PMID: 19844580

Dataset Information

A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions.

Background

Methodology

Significance

Publications

A statistical model of protein sequence similarity and function similarity reveals overly-specific function predictions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Wei2GO: weighted sequence similarity-based protein function prediction.
| S-EPMC8855713 | biostudies-literature

Effusion: prediction of protein function from sequence similarity networks.
| S-EPMC6361244 | biostudies-literature

GPSFun: geometry-aware protein sequence function predictions with language models.
| S-EPMC11223820 | biostudies-literature

Enhancing TCR specificity predictions by combined pan- and peptide-specific training, loss-scaling, and sequence similarity integration.
| S-EPMC10942633 | biostudies-literature

Modeling sequence and function similarity between proteins for protein functional annotation.
| S-EPMC4120521 | biostudies-literature

Prediction of enzyme function by combining sequence similarity and protein interactions.
| S-EPMC2430716 | biostudies-literature

Quantitative assessment of relationship between sequence similarity and function similarity.
| S-EPMC1949826 | biostudies-literature

Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks.
| S-EPMC4457552 | biostudies-literature

INGA: protein function prediction combining interaction networks, domain assignments and sequence similarity.
| S-EPMC4489281 | biostudies-literature

Testing statistical significance scores of sequence comparison methods with structure similarity.
| S-EPMC1618413 | biostudies-literature