Unknown

Dataset Information

0

Unique function words characterize genomic proteins.


ABSTRACT: Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of "words" or UFWs (57% shared), the "sentences" (MDAs) are different (1.3% shared).

SUBMITTER: Scaiewicz A 

PROVIDER: S-EPMC6042118 | biostudies-literature | 2018 Jun

REPOSITORIES: biostudies-literature

altmetric image

Publications

Unique function words characterize genomic proteins.

Scaiewicz Andrea A   Levitt Michael M  

Proceedings of the National Academy of Sciences of the United States of America 20180612 26


Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant prof  ...[more]

Similar Datasets

| S-EPMC3532174 | biostudies-literature
| S-EPMC1592090 | biostudies-other
| S-EPMC8397784 | biostudies-literature
| S-EPMC2375138 | biostudies-literature
| S-EPMC3541209 | biostudies-literature
| S-EPMC2865649 | biostudies-literature
| S-EPMC3364958 | biostudies-literature
| S-EPMC3275550 | biostudies-literature
2012-11-01 | GSE39454 | GEO
| S-EPMC5091715 | biostudies-literature