Dataset Information

Automated protein subfamily identification and classification.

ABSTRACT: Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally efficient pipeline for phylogenomic classification of proteins. This pipeline uses the SCI-PHY (Subfamily Classification in Phylogenomics) algorithm for automatic subfamily identification, followed by subfamily hidden Markov model (HMM) construction. A simple and computationally efficient scoring scheme using family and subfamily HMMs enables classification of novel sequences to protein families and subfamilies. Sequences representing entirely novel subfamilies are differentiated from those that can be classified to subfamilies in the input training set using logistic regression. Subfamily HMM parameters are estimated using an information-sharing protocol, enabling subfamilies containing even a single sequence to benefit from conservation patterns defining the family as a whole or in related subfamilies. SCI-PHY subfamilies correspond closely to functional subtypes defined by experts and to conserved clades found by phylogenetic analysis. Extensive comparisons of subfamily and family HMM performances show that subfamily HMMs dramatically improve the separation between homologous and non-homologous proteins in sequence database searches. Subfamily HMMs also provide extremely high specificity of classification and can be used to predict entirely novel subtypes. The SCI-PHY Web server at http://phylogenomics.berkeley.edu/SCI-PHY/ allows users to upload a multiple sequence alignment for subfamily identification and subfamily HMM construction. Biologists wishing to provide their own subfamily definitions can do so. Source code is available on the Web page. The Berkeley Phylogenomics Group PhyloFacts resource contains pre-calculated subfamily predictions and subfamily HMMs for more than 40,000 protein families and domains at http://phylogenomics.berkeley.edu/phylofacts/.

SUBMITTER: Brown DP

PROVIDER: S-EPMC1950344 | biostudies-literature | 2007 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Automated protein subfamily identification and classification.

Brown Duncan P DP Krishnamurthy Nandini N Sjölander Kimmen K

PLoS computational biology 20070801 8

Function prediction by homology is widely used to provide preliminary functional annotations for genes for which experimental evidence of function is unavailable or limited. This approach has been shown to be prone to systematic error, including percolation of annotation errors through sequence databases. Phylogenomic analysis avoids these errors in function prediction but has been difficult to automate for high-throughput application. To address this limitation, we present a computationally eff ...[more]

PMID: 17708678

Dataset Information

Automated protein subfamily identification and classification.

Publications

Automated protein subfamily identification and classification.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

pocketZebra: a web-server for automated selection and classification of subfamily-specific binding sites by bioinformatic analysis of diverse protein families.
| S-EPMC4086101 | biostudies-literature

Domain-mediated interactions for protein subfamily identification.
| S-EPMC6959277 | biostudies-literature

GeMMA: functional subfamily classification within superfamilies of predicted protein structural domains.
| S-EPMC2817468 | biostudies-literature

Automated functional classification of experimental and predicted protein structures.
| S-EPMC1513613 | biostudies-literature

Top-down clustering for protein subfamily identification.
| S-EPMC3653887 | biostudies-other

ViCTree: an automated framework for taxonomic classification from protein sequences.
| S-EPMC6022645 | biostudies-literature

Automated classification and identification of slow wave propagation patterns in gastric dysrhythmia.
| S-EPMC3911879 | biostudies-literature

A multiresolution approach to automated classification of protein subcellular location images.
| S-EPMC1933440 | biostudies-literature

ArchDB: automated protein loop classification as a tool for structural genomics.
| S-EPMC308737 | biostudies-literature

Automated classification platform for the identification of otitis media using optical coherence tomography.
| S-EPMC6550205 | biostudies-literature