Dataset Information

Application of an interpretable classification model on Early Folding Residues during protein folding.

ABSTRACT:

Background

Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models.

Results

Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver operating characteristic of 76.6% was achieved which is comparable to other state-of-the-art classifiers. The obtained model is accessible at https://biosciences.hs-mittweida.de/efpred/.

Conclusions

The application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results.

SUBMITTER: Bittrich S

PROVIDER: S-EPMC6321665 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Application of an interpretable classification model on Early Folding Residues during protein folding.

Bittrich Sebastian S Kaden Marika M Leberecht Christoph C Kaiser Florian F Villmann Thomas T Labudde Dirk D

BioData mining 20190105

<h4>Background</h4>Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models.<h4>Results</h4>Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learnin ...[more]

PMID: 30627219

Dataset Information

Application of an interpretable classification model on Early Folding Residues during protein folding.

Background

Results

Conclusions

Publications

Application of an interpretable classification model on Early Folding Residues during protein folding.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Quantification of Drive-Response Relationships Between Residues During Protein Folding.
| S-EPMC3819712 | biostudies-literature

Early classification of multivariate temporal observations by extraction of interpretable shapelets.
| S-EPMC3475011 | biostudies-literature

Interpretable CNN for ischemic stroke subtype classification with active model adaptation.
| S-EPMC8729146 | biostudies-literature

Hypothetical in silico model of the early-stage intermediate in protein folding.
| S-EPMC3778223 | biostudies-literature

Changes of protein stiffness during folding detect protein folding intermediates.
| S-EPMC3923959 | biostudies-literature

Characterizing the relation of functional and Early Folding Residues in protein structures using the example of aminoacyl-tRNA synthetases.
| S-EPMC6207335 | biostudies-literature

A Stochastic Landscape Approach for Protein Folding State Classification.
| S-EPMC11238538 | biostudies-literature

Protein folding: Funnel model revised.
| S-EPMC11550765 | biostudies-literature

SeqRate: sequence-based protein folding type classification and rates prediction.
| S-EPMC2863059 | biostudies-literature

Interpretable machine learning with tree-based shapley additive explanations: Application to metabolomics datasets for binary classification.
| S-EPMC10159207 | biostudies-literature