Unknown

Dataset Information

0

Clustering the annotation space of proteins.


ABSTRACT: BACKGROUND: Current protein clustering methods rely on either sequence or functional similarities between proteins, thereby limiting inferences to one of these areas. RESULTS: Here we report a new approach, named CLAN, which clusters proteins according to both annotation and sequence similarity. This approach is extremely fast, clustering the complete SwissProt database within minutes. It is also accurate, recovering consistent protein families agreeing on average in more than 97% with sequence-based protein families from Pfam. Discrepancies between sequence- and annotation-based clusters were scrutinized and the reasons reported. We demonstrate examples for each of these cases, and thoroughly discuss an example of a propagated error in SwissProt: a vacuolar ATPase subunit M9.2 erroneously annotated as vacuolar ATP synthase subunit H. CLAN algorithm is available from the authors and the CLAN database is accessible at http://maine.ebi.ac.uk:8000/cgi-bin/clan/ClanSearch.pl CONCLUSIONS: CLAN creates refined function-and-sequence specific protein families that can be used for identification and annotation of unknown family members. It also allows easy identification of erroneous annotations by spotting inconsistencies between similarities on annotation and sequence levels.

SUBMITTER: Kunin V 

PROVIDER: S-EPMC552314 | biostudies-literature | 2005

REPOSITORIES: biostudies-literature

altmetric image

Publications

Clustering the annotation space of proteins.

Kunin Victor V   Ouzounis Christos A CA  

BMC bioinformatics 20050209


<h4>Background</h4>Current protein clustering methods rely on either sequence or functional similarities between proteins, thereby limiting inferences to one of these areas.<h4>Results</h4>Here we report a new approach, named CLAN, which clusters proteins according to both annotation and sequence similarity. This approach is extremely fast, clustering the complete SwissProt database within minutes. It is also accurate, recovering consistent protein families agreeing on average in more than 97% w  ...[more]

Similar Datasets

| S-EPMC6403173 | biostudies-literature
| S-EPMC4803255 | biostudies-literature
| S-EPMC2951670 | biostudies-literature
| S-EPMC6437941 | biostudies-literature
| S-EPMC4213798 | biostudies-other
| S-EPMC7308937 | biostudies-literature
| S-EPMC2147035 | biostudies-literature
| S-EPMC4157666 | biostudies-literature
| S-EPMC3018814 | biostudies-other
2018-11-26 | GSE94027 | GEO