Unknown

Dataset Information

0

Integration of phenotypic metadata and protein similarity in Archaea using a spectral bipartitioning approach.


ABSTRACT: In order to simplify and meaningfully categorize large sets of protein sequence data, it is commonplace to cluster proteins based on the similarity of those sequences. However, it quickly becomes clear that the sequence flexibility allowed a given protein varies significantly among different protein families. The degree to which sequences are conserved not only differs for each protein family, but also is affected by the phylogenetic divergence of the source organisms. Clustering techniques that use similarity thresholds for protein families do not always allow for these variations and thus cannot be confidently used for applications such as automated annotation and phylogenetic profiling. In this work, we applied a spectral bipartitioning technique to all proteins from 53 archaeal genomes. Comparisons between different taxonomic levels allowed us to study the effects of phylogenetic distances on cluster structure. Likewise, by associating functional annotations and phenotypic metadata with each protein, we could compare our protein similarity clusters with both protein function and associated phenotype. Our clusters can be analyzed graphically and interactively online.

SUBMITTER: Hooper SD 

PROVIDER: S-EPMC2673424 | biostudies-literature | 2009 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

Integration of phenotypic metadata and protein similarity in Archaea using a spectral bipartitioning approach.

Hooper Sean D SD   Anderson Iain J IJ   Pati Amrita A   Dalevi Daniel D   Mavromatis Konstantinos K   Kyrpides Nikos C NC  

Nucleic acids research 20090217 7


In order to simplify and meaningfully categorize large sets of protein sequence data, it is commonplace to cluster proteins based on the similarity of those sequences. However, it quickly becomes clear that the sequence flexibility allowed a given protein varies significantly among different protein families. The degree to which sequences are conserved not only differs for each protein family, but also is affected by the phylogenetic divergence of the source organisms. Clustering techniques that  ...[more]

Similar Datasets

| S-EPMC4922564 | biostudies-literature
| S-EPMC3542577 | biostudies-literature
| S-EPMC7660437 | biostudies-literature
| S-EPMC1131888 | biostudies-literature
| S-EPMC7094169 | biostudies-literature
| S-EPMC9323741 | biostudies-literature
| S-EPMC9216561 | biostudies-literature
| S-EPMC203378 | biostudies-literature
| S-EPMC41582 | biostudies-other
| S-EPMC6454479 | biostudies-literature