Dataset Information

Unique function words characterize genomic proteins.

ABSTRACT: Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of "words" or UFWs (57% shared), the "sentences" (MDAs) are different (1.3% shared).

SUBMITTER: Scaiewicz A

PROVIDER: S-EPMC6042118 | biostudies-literature | 2018 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Unique function words characterize genomic proteins.

Scaiewicz Andrea A Levitt Michael M

Proceedings of the National Academy of Sciences of the United States of America 20180612 26

Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant prof ...[more]

PMID: 29895692

Similar Datasets

Project description:BACKGROUND: Oomycetes are fungal-like microorganisms evolutionary distinct from true fungi, belonging to the Stramenopile lineage and comprising major plant pathogens. Both oomycetes and fungi express proteins able to interact with cellulose, a major component of plant and oomycete cell walls, through the presence of carbohydrate-binding module belonging to the family 1 (CBM1). Fungal CBM1-containing proteins were implicated in cellulose degradation whereas in oomycetes, the Cellulose Binding Elicitor Lectin (CBEL), a well-characterized CBM1-protein from Phytophthora parasitica, was implicated in cell wall integrity, adhesion to cellulosic substrates and induction of plant immunity. RESULTS: To extend our knowledge on CBM1-containing proteins in oomycetes, we have conducted a comprehensive analysis on 60 fungi and 7 oomycetes genomes leading to the identification of 518 CBM1-containing proteins. In plant-interacting microorganisms, the larger number of CBM1-protein coding genes is expressed by necrotroph and hemibiotrophic pathogens, whereas a strong reduction of these genes is observed in symbionts and biotrophs. In fungi, more than 70% of CBM1-containing proteins correspond to enzymatic proteins in which CBM1 is associated with a catalytic unit involved in cellulose degradation. In oomycetes more than 90% of proteins are similar to CBEL in which CBM1 is associated with a non-catalytic PAN/Apple domain, known to interact with specific carbohydrates or proteins. Distinct Stramenopile genomes like diatoms and brown algae are devoid of CBM1 coding genes. A CBM1-PAN/Apple association 3D structural modeling was built allowing the identification of amino acid residues interacting with cellulose and suggesting the putative interaction of the PAN/Apple domain with another type of glucan. By Surface Plasmon Resonance experiments, we showed that CBEL binds to glycoproteins through galactose or N-acetyl-galactosamine motifs. CONCLUSIONS: This study provides insight into the evolution and biological roles of CBM1-containing proteins from oomycetes. We show that while CBM1s from fungi and oomycetes are similar, they team up with different protein domains, either in proteins implicated in the degradation of plant cell wall components in the case of fungi or in proteins involved in adhesion to polysaccharidic substrates in the case of oomycetes. This work highlighted the unique role and evolution of CBM1 proteins in oomycete among the Stramenopile lineage.

Project description:BACKGROUND: Leukocyte infiltration plays an important role in the pathogenesis and progression of myositis, and is highly associated with disease severity. Currently, there is a lack of: efficacious therapies for myositis; understanding of the molecular features important for disease pathogenesis; and potential molecular biomarkers for characterizing inflammatory myopathies to aid in clinical development. METHODS: In this study, we developed a simple model and predicted that 1) leukocyte-specific transcripts (including both protein-coding transcripts and microRNAs) should be coherently overexpressed in myositis muscle and 2) the level of over-expression of these transcripts should be correlated with leukocyte infiltration. We applied this model to assess immune cell infiltration in myositis by examining mRNA and microRNA (miRNA) expression profiles in muscle biopsies from 31 myositis patients and 5 normal controls. RESULTS: Several gene signatures, including a leukocyte index, type 1 interferon (IFN), MHC class I, and immunoglobulin signature, were developed to characterize myositis patients at the molecular level. The leukocyte index, consisting of genes predominantly associated with immune function, displayed strong concordance with pathological assessment of immune cell infiltration. This leukocyte index was subsequently utilized to differentiate transcriptional changes due to leukocyte infiltration from other alterations in myositis muscle. Results from this differentiation revealed biologically relevant differences in the relationship between the type 1 IFN pathway, miR-146a, and leukocyte infiltration within various myositis subtypes. CONCLUSIONS: Results indicate that a likely interaction between miR-146a expression and the type 1 IFN pathway is confounded by the level of leukocyte infiltration into muscle tissue. Although the role of miR-146a in myositis remains uncertain, our results highlight the potential benefit of deconvoluting the source of transcriptional changes in myositis muscle or other heterogeneous tissue samples. Taken together, the leukocyte index and other gene signatures developed in this study may be potential molecular biomarkers to help to further characterize inflammatory myopathies and aid in clinical development. These hypotheses need to be confirmed in separate and sufficiently powered clinical trials.

Dataset Information

Unique function words characterize genomic proteins.

Publications

Unique function words characterize genomic proteins.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets