Dataset Information

Clustering of protein domains for functional and evolutionary studies.

ABSTRACT: BACKGROUND: The number of protein family members defined by DNA sequencing is usually much larger than those characterised experimentally. This paper describes a method to divide protein families into subtypes purely on sequence criteria. Comparison with experimental data allows an independent test of the quality of the clustering. RESULTS: An evolutionary split statistic is calculated for each column in a protein multiple sequence alignment; the statistic has a larger value when a column is better described by an evolutionary model that assumes clustering around two or more amino acids rather than a single amino acid. The user selects columns (typically the top ranked columns) to construct a motif. The motif is used to divide the family into subtypes using a stochastic optimization procedure related to the deterministic annealing EM algorithm (DAEM), which yields a specificity score showing how well each family member is assigned to a subtype. The clustering obtained is not strongly dependent on the number of amino acids chosen for the motif. The robustness of this method was demonstrated using six well characterized protein families: nucleotidyl cyclase, protein kinase, dehydrogenase, two polyketide synthase domains and small heat shock proteins. Phylogenetic trees did not allow accurate clustering for three of the six families. CONCLUSION: The method clustered the families into functional subtypes with an accuracy of 90 to 100%. False assignments usually had a low specificity score.

SUBMITTER: Goldstein P

PROVIDER: S-EPMC2770074 | biostudies-literature | 2009

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Clustering of protein domains for functional and evolutionary studies.

Goldstein Pavle P Zucko Jurica J Vujaklija Dusica D Krisko Anita A Hranueli Daslav D Long Paul F PF Etchebest Catherine C Basrak Bojan B Cullum John J

BMC bioinformatics 20091015

<h4>Background</h4>The number of protein family members defined by DNA sequencing is usually much larger than those characterised experimentally. This paper describes a method to divide protein families into subtypes purely on sequence criteria. Comparison with experimental data allows an independent test of the quality of the clustering.<h4>Results</h4>An evolutionary split statistic is calculated for each column in a protein multiple sequence alignment; the statistic has a larger value when a ...[more]

PMID: 19832975

Similar Datasets

Project description:BACKGROUND:In bacteria, genes with related functions-such as those involved in the metabolism of the same compound or in infection processes-are often physically close on the genome and form groups called clusters. The enrichment of such clusters over various distantly related bacteria can be used to predict the roles of genes of unknown function that cluster with characterised genes. There is no obvious rule to define a cluster, given their variability in size and intergenic distances, and the definition of what comprises a "gene", since genes can gain and lose domains over time. Protein domains can cluster within a gene, or in adjacent genes of related function, and in both cases these are chromosomally clustered. Here, we model the distances between pairs of protein domain coding regions across a wide range of bacteria and archaea via a probabilistic two component mixture model, without imposing arbitrary thresholds in terms of gene numbers or distances. RESULTS:We trained our model using matched gene ontology terms to label functionally related pairs and assess the stability of the parameters of the model across 14,178 archaeal and bacterial strains. We found that the parameters of our mixture model are remarkably stable across bacteria and archaea, except for endosymbionts and obligate intracellular pathogens. Obligate pathogens have smaller genomes, and although they vary, on average do not show noticeably different clustering distances; the main difference in the parameter estimates is that a far greater proportion of the genes sharing ontology terms are clustered. This may reflect that these genomes are enriched for complexes encoded by clustered core housekeeping genes, as a proportion of the total genes. Given the overall stability of the parameter estimates, we then used the mean parameter estimates across the entire dataset to investigate which gene ontology terms are most frequently associated with clustered genes. CONCLUSIONS:Given the stability of the mixture model across species, it may be used to predict bacterial gene clusters that are shared across multiple species, in addition to giving insights into the evolutionary pressures on the chromosomal locations of genes in different species.

Dataset Information

Clustering of protein domains for functional and evolutionary studies.

Publications

Clustering of protein domains for functional and evolutionary studies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets