Dataset Information

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets.

ABSTRACT:

Background

Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we propose to use the alignment-free Relative Complexity Measure (RCM) combined with reduced amino acid alphabets to cluster protein families into functional subtypes purely on sequence criteria. Comparison with an alignment-based approach was also carried out to test the quality of the clustering.

Results

We demonstrate the robustness of RCM with reduced alphabets in clustering of protein sequences into families in a simulated dataset and seven well-characterized protein datasets. On protein datasets, crotonases, mandelate racemases, nucleotidyl cyclases and glycoside hydrolase family 2 were clustered into subfamilies with 100% accuracy whereas acyl transferase domains, haloacid dehalogenases, and vicinal oxygen chelates could be assigned to subfamilies with 97.2%, 96.9% and 92.2% accuracies, respectively.

Conclusions

The overall combination of methods in this paper is useful for clustering protein families into subtypes based on solely protein sequence information. The method is also flexible and computationally fast because it does not require multiple alignment of sequences.

SUBMITTER: Albayrak A

PROVIDER: S-EPMC2936399 | biostudies-literature | 2010 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets.

Albayrak Aydin A Otu Hasan H HH Sezerman Ugur O UO

BMC bioinformatics 20100818

<h4>Background</h4>Phylogenetic analysis can be used to divide a protein family into subfamilies in the absence of experimental information. Most phylogenetic analysis methods utilize multiple alignment of sequences and are based on an evolutionary model. However, multiple alignment is not an automated procedure and requires human intervention to maintain alignment integrity and to produce phylogenies consistent with the functional splits in underlying sequences. To address this problem, we prop ...[more]

PMID: 20718947

Similar Datasets

Project description:BackgroundIn structural genomics, an important goal is the detection and classification of protein-protein interactions, given the structures of the interacting partners. We have developed empirical energy functions to identify native structures of protein-protein complexes among sets of decoy structures. To understand the role of amino acid diversity, we parameterized a series of functions, using a hierarchy of amino acid alphabets of increasing complexity, with 2, 3, 4, 6, and 20 amino acid groups. Compared to previous work, we used the simplest possible functional form, with residue-residue interactions and a stepwise distance-dependence. We used increased computational resources, however, constructing 290,000 decoys for 219 protein-protein complexes, with a realistic docking protocol where the protein partners are flexible and interact through a molecular mechanics energy function. The energy parameters were optimized to correctly assign as many native complexes as possible. To resolve the multiple minimum problem in parameter space, over 64000 starting parameter guesses were tried for each energy function. The optimized functions were tested by cross validation on subsets of our native and decoy structures, by blind tests on series of native and decoy structures available on the Web, and on models for 13 complexes submitted to the CAPRI structure prediction experiment.ResultsPerformance is similar to several other statistical potentials of the same complexity. For example, the CAPRI target structure is correctly ranked ahead of 90% of its decoys in 6 cases out of 13. The hierarchy of amino acid alphabets leads to a coherent hierarchy of energy functions, with qualitatively similar parameters for similar amino acid types at all levels. Most remarkably, the performance with six amino acid classes is equivalent to that of the most detailed, 20-class energy function.ConclusionThis suggests that six carefully chosen amino acid classes are sufficient to encode specificity in protein-protein interactions, and provide a starting point to develop more complicated energy functions.

Dataset Information

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets.

Background

Results

Conclusions

Publications

Clustering of protein families into functional subtypes using Relative Complexity Measure with reduced amino acid alphabets.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets