Dataset Information

Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering.

ABSTRACT: As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem's complexity. Hence, this work's purpose is to detect isofunctional subfamilies inside a family of unknown function, while identifying differentiating residues. Similarity between protein pairs according to various properties is interpreted as functional similarity evidence. Data are integrated using genetic programming and provided to a spectral clustering algorithm, which creates clusters of similar proteins. The proposed framework was applied to well-known protein families and to a family of unknown function, then compared to ASMC. Results showed our fully automated technique obtained better clusters than ASMC for two families, besides equivalent results for other two, including one whose clusters were manually defined. Clusters produced by our framework showed great correspondence with the known subfamilies, besides being more contrasting than those produced by ASMC. Additionally, for the families whose specificity determining positions are known, such residues were among those our technique considered most important to differentiate a given group. When run with the crotonase and enolase SFLD superfamilies, the results showed great agreement with this gold-standard. Best results consistently involved multiple data types, thus confirming our hypothesis that similarities according to different knowledge domains may be used as functional similarity evidence. Our main contributions are the proposed strategy for selecting and integrating data types, along with the ability to work with noisy and incomplete data; domain knowledge usage for detecting subfamilies in a family with different specificities, thus reducing the complexity of the experimental function characterization problem; and the identification of residues responsible for specificity.

SUBMITTER: Boari de Lima E

PROVIDER: S-EPMC4922564 | biostudies-literature | 2016 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering.

Boari de Lima Elisa E Meira Wagner W Melo-Minardi Raquel Cardoso de RC

PLoS computational biology 20160627 6

As increasingly more genomes are sequenced, the vast majority of proteins may only be annotated computationally, given experimental investigation is extremely costly. This highlights the need for computational methods to determine protein functions quickly and reliably. We believe dividing a protein family into subtypes which share specific functions uncommon to the whole family reduces the function annotation problem's complexity. Hence, this work's purpose is to detect isofunctional subfamilie ...[more]

PMID: 27348631

Dataset Information

Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering.

Publications

Isofunctional Protein Subfamily Detection Using Data Integration and Spectral Clustering.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Revealing functionally coherent subsets using a spectral clustering and an information integration approach.
| S-EPMC3542577 | biostudies-literature

Clustering-independent analysis of genomic data using spectral simplicial theory.
| S-EPMC6897424 | biostudies-literature

Spectral clustering of protein sequences.
| S-EPMC1409676 | biostudies-literature

Determination of biomarkers from microarray data using graph neural network and spectral clustering.
| S-EPMC8668890 | biostudies-literature

Top-down clustering for protein subfamily identification.
| S-EPMC3653887 | biostudies-other

FragHub: A Mass Spectral Library Data Integration Workflow.
| S-EPMC11295123 | biostudies-literature

SMAC: Simultaneous Mapping and Clustering Using Spectral Decompositions.
| S-EPMC7394310 | biostudies-literature

Diffusion model based spectral clustering for protein-protein interaction networks.
| S-EPMC2935381 | biostudies-literature

Rational partitioning of spectral feature space for effective clustering of massive spectral image data.
| S-EPMC11439947 | biostudies-literature

Data reduction for spectral clustering to analyze high throughput flow cytometry data.
| S-EPMC2923634 | biostudies-literature