Dataset Information

Predicting protein model correctness in Coot using machine learning.

ABSTRACT: Manually identifying and correcting errors in protein models can be a slow process, but improvements in validation tools and automated model-building software can contribute to reducing this burden. This article presents a new correctness score that is produced by combining multiple sources of information using a neural network. The residues in 639 automatically built models were marked as correct or incorrect by comparing them with the coordinates deposited in the PDB. A number of features were also calculated for each residue using Coot, including map-to-model correlation, density values, B factors, clashes, Ramachandran scores, rotamer scores and resolution. Two neural networks were created using these features as inputs: one to predict the correctness of main-chain atoms and the other for side chains. The 639 structures were split into 511 that were used to train the neural networks and 128 that were used to test performance. The predicted correctness scores could correctly categorize 92.3% of the main-chain atoms and 87.6% of the side chains. A Coot ML Correctness script was written to display the scores in a graphical user interface as well as for the automatic pruning of chains, residues and side chains with low scores. The automatic pruning function was added to the CCP4i2 Buccaneer automated model-building pipeline, leading to significant improvements, especially for high-resolution structures.

SUBMITTER: Bond PS

PROVIDER: S-EPMC7397494 | biostudies-literature | 2020 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Predicting protein model correctness in Coot using machine learning.

Bond Paul S PS Wilson Keith S KS Cowtan Kevin D KD

Acta crystallographica. Section D, Structural biology 20200727 Pt 8

Manually identifying and correcting errors in protein models can be a slow process, but improvements in validation tools and automated model-building software can contribute to reducing this burden. This article presents a new correctness score that is produced by combining multiple sources of information using a neural network. The residues in 639 automatically built models were marked as correct or incorrect by comparing them with the coordinates deposited in the PDB. A number of features were ...[more]

PMID: 32744253

Similar Datasets

Project description:Hypertension is a widely prevalent disease and uncontrolled hypertension predisposes affected individuals to severe adverse effects. Though the importance of controlling hypertension is clear, the multitude of therapeutic regimens and patient factors that affect the success of blood pressure control makes it difficult to predict the likelihood to predict whether a patient's blood pressure will be controlled. This project endeavors to investigate whether machine learning can accurately predict the control of a patient's hypertension within 12 months of a clinical encounter. To build the machine learning model, a retrospective review of the electronic medical records of 350,008 patients 18 years of age and older between January 1, 2015 and June 1, 2022 was performed to form model training and testing cohorts. The data included in the model included medication combinations, patient laboratory values, vital sign measurements, comorbidities, healthcare encounters, and demographic information. The mean age of the patient population was 65.6 years with 161,283 (46.1%) men and 275,001 (78.6%) white. A sliding time window of data was used to both prohibit data leakage from training sets to test sets and to maximize model performance. This sliding window resulted in using the study data to create 287 predictive models each using 2 years of training data and one week of testing data for a total study duration of five and a half years. Model performance was combined across all models. The primary outcome, prediction of blood pressure control within 12 months demonstrated an area under the curve of 0.76 (95% confidence interval; 0.75-0.76), sensitivity of 61.52% (61.0-62.03%), specificity of 75.69% (75.25-76.13%), positive predictive value of 67.75% (67.51-67.99%), and negative predictive value of 70.49% (70.32-70.66%). An AUC of 0.756 is considered to be moderately good for machine learning models. While the accuracy of this model is promising, it is impossible to state with certainty the clinical relevancy of any clinical support ML model without deploying it in a clinical setting and studying its impact on health outcomes. By also incorporating uncertainty analysis for every prediction, the authors believe that this approach offers the best-known solution to predicting hypertension control and that machine learning may be able to improve the accuracy of hypertension control predictions using patient information already available in the electronic health record. This method can serve as a foundation with further research to strengthen the model accuracy and to help determine clinical relevance.

Project description:BackgroundThe incorporation of machine learning is becoming more prevalent in the clinical setting. By predicting clinical outcomes, machine learning can provide clinicians with a valuable tool for refining precision medicine approaches and improving treatment outcomes.MethodsThis was a post hoc analysis of pooled patient-level data from the global, real-world ACTION and ASCORE trials in patients with rheumatoid arthritis (RA) initiating abatacept. Patient demographic and disease characteristics were input across 10 machine learning models used to predict 12-month treatment retention. Retention was defined as treatment for > 365 days or ≤ 365 days in patients who achieved remission or major clinical response (based on European Alliance of Associations for Rheumatology response criteria). The pooled dataset was split into a training/validation cohort for model development and a test cohort for an unbiased evaluation of performance. SHapley Additive exPlanation (SHAP) values determined the level of importance and directionality for key patient features predicting abatacept retention.ResultsThe pooled ACTION and ASCORE dataset included 5320 patients with RA (mean [standard deviation] age 57.7 [12.7] years; 79% female). The 12-month abatacept retention rate was 61% (n = 3236) with a discontinuation rate of 39% (n = 2037). In the training set (n = 4218), the gradient-boosting classifier model demonstrated the best performance (testing accuracy: 62%). This model had an area under the receiver operating characteristic curve (95% confidence interval) of 0.620 (0.586, 0.653) and F1 score of 0.659 (0.625, 0.689) in the test set of patients (n = 1055). Using this model, the five most important variables predicting 12-month abatacept retention were low body mass index (BMI), low American College of Rheumatology functional status class, anti-citrullinated protein antibody (ACPA) positivity, low Patient Global Assessment, and younger age.ConclusionsThe gradient-boosting classifier model identified key patient features predictive of abatacept retention from this large, real-world study population. The SHAP values conveyed the directionality and importance of BMI, functional status, ACPA serostatus, Patient Global Assessment, and age for abatacept retention. Findings are consistent with previous observations and help validate the machine learning approach for predictive modelling in RA treatment, and may help inform clinical decision making.Trial registrationNCT02109666 (ACTION), NCT02090556 (ASCORE).

Dataset Information

Predicting protein model correctness in Coot using machine learning.

Publications

Predicting protein model correctness in Coot using machine learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets