Dataset Information

Identify High-Quality Protein Structural Models by Enhanced K-Means.

ABSTRACT: Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we proposed two enhanced K-means clustering algorithms capable of robustly identifying high-quality protein structural models. The first one employs the clustering algorithm SPICKER to determine the initial centroids for basic K-means clustering (SK-means), whereas the other employs squared distance to optimize the initial centroids (K-means++). Our results showed that SK-means and K-means++ were more robust as compared with SPICKER alone, detecting 33 (59%) and 42 (75%) of 56 targets, respectively, with template modeling scores better than or equal to those of SPICKER. Conclusions. We observed that the classic K-means algorithm showed a similar performance to that of SPICKER, which is a widely used algorithm for protein-structure identification. Both SK-means and K-means++ demonstrated substantial improvements relative to results from SPICKER and classical K-means.

SUBMITTER: Wu H

PROVIDER: S-EPMC5381204 | biostudies-literature | 2017

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Identify High-Quality Protein Structural Models by Enhanced K-Means.

Wu Hongjie H Li Haiou H Jiang Min M Chen Cheng C Lv Qiang Q Wu Chuang C

BioMed research international 20170322

Background. One critical issue in protein three-dimensional structure prediction using either ab initio or comparative modeling involves identification of high-quality protein structural models from generated decoys. Currently, clustering algorithms are widely used to identify near-native models; however, their performance is dependent upon different conformational decoys, and, for some algorithms, the accuracy declines when the decoy population increases. Results. Here, we propose ...[more]

PMID: 28421198

Similar Datasets

Project description:ObjectiveAcute ischemic stroke (AIS) is a heterogeneous condition. To stratify the heterogeneity, identify novel phenotypes, and develop Clinlabomics models of phenotypes that can conduct more personalized treatments for AIS.MethodsIn a retrospective analysis, consecutive AIS and non-AIS inpatients were enrolled. An unsupervised k-means clustering algorithm was used to classify AIS patients into distinct novel phenotypes. Besides, the intergroup comparisons across the phenotypes were performed in clinical and laboratory data. Next, the least absolute shrinkage and selection operator (LASSO) algorithm was used to select essential variables. In addition, Clinlabomics predictive models of phenotypes were established by a support vector machines (SVM) classifier. We used the area under curve (AUC), accuracy, sensitivity, and specificity to evaluate the performance of the models.ResultsOf the three derived phenotypes in 909 AIS patients [median age 64 (IQR: 17) years, 69% male], in phenotype 1 (N = 401), patients were relatively young and obese and had significantly elevated levels of lipids. Phenotype 2 (N = 463) was associated with abnormal ion levels. Phenotype 3 (N = 45) was characterized by the highest level of inflammation, accompanied by mild multiple-organ dysfunction. The external validation cohort prospectively collected 507 AIS patients [median age 60 (IQR: 18) years, 70% male]. Phenotype characteristics were similar in the validation cohort. After LASSO analysis, Clinlabomics models of phenotype 1 and 2 were constructed by the SVM algorithm, yielding high AUC (0.977, 95% CI: 0.961-0.993 and 0.984, 95% CI: 0.971-0.997), accuracy (0.936, 95% CI: 0.922-0.956 and 0.952, 95% CI: 0.938-0.972), sensitivity (0.984, 95% CI: 0.968-0.998 and 0.958, 95% CI: 0.939-0.984), and specificity (0.892, 95% CI: 0.874-0.926 and 0.945, 95% CI: 0.923-0.969).ConclusionIn this study, three novel phenotypes that reflected the abnormal variables of AIS patients were identified, and the Clinlabomics models of phenotypes were established, which are conducive to individualized treatments.

Project description:The state-of-the-art to assess the structural quality of docking models is currently based on three related yet independent quality measures: Fnat, LRMS, and iRMS as proposed and standardized by CAPRI. These quality measures quantify different aspects of the quality of a particular docking model and need to be viewed together to reveal the true quality, e.g. a model with relatively poor LRMS (>10Å) might still qualify as 'acceptable' with a descent Fnat (>0.50) and iRMS (<3.0Å). This is also the reason why the so called CAPRI criteria for assessing the quality of docking models is defined by applying various ad-hoc cutoffs on these measures to classify a docking model into the four classes: Incorrect, Acceptable, Medium, or High quality. This classification has been useful in CAPRI, but since models are grouped in only four bins it is also rather limiting, making it difficult to rank models, correlate with scoring functions or use it as target function in machine learning algorithms. Here, we present DockQ, a continuous protein-protein docking model quality measure derived by combining Fnat, LRMS, and iRMS to a single score in the range [0, 1] that can be used to assess the quality of protein docking models. By using DockQ on CAPRI models it is possible to almost completely reproduce the original CAPRI classification into Incorrect, Acceptable, Medium and High quality. An average PPV of 94% at 90% Recall demonstrating that there is no need to apply predefined ad-hoc cutoffs to classify docking models. Since DockQ recapitulates the CAPRI classification almost perfectly, it can be viewed as a higher resolution version of the CAPRI classification, making it possible to estimate model quality in a more quantitative way using Z-scores or sum of top ranked models, which has been so valuable for the CASP community. The possibility to directly correlate a quality measure to a scoring function has been crucial for the development of scoring functions for protein structure prediction, and DockQ should be useful in a similar development in the protein docking field. DockQ is available at http://github.com/bjornwallner/DockQ/.

Dataset Information

Identify High-Quality Protein Structural Models by Enhanced K-Means.

Publications

Identify High-Quality Protein Structural Models by Enhanced <i>K</i>-Means.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets