Unknown

Dataset Information

0

Spectral clustering of protein sequences.


ABSTRACT: An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL].

SUBMITTER: Paccanaro A 

PROVIDER: S-EPMC1409676 | biostudies-literature | 2006

REPOSITORIES: biostudies-literature

altmetric image

Publications

Spectral clustering of protein sequences.

Paccanaro Alberto A   Casbon James A JA   Saqi Mansoor A S MA  

Nucleic acids research 20060317 5


An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will  ...[more]

Similar Datasets

| S-EPMC2935381 | biostudies-literature
| S-EPMC547898 | biostudies-literature
| S-EPMC4922564 | biostudies-literature
| S-EPMC5793492 | biostudies-literature
| S-EPMC1845149 | biostudies-literature
| S-EPMC3163914 | biostudies-literature
| S-EPMC5798376 | biostudies-literature
| S-EPMC6309053 | biostudies-literature
| S-EPMC6454479 | biostudies-literature
| S-EPMC5635860 | biostudies-literature