Dataset Information

Better models by discarding data?

ABSTRACT: In macromolecular X-ray crystallography, typical data sets have substantial multiplicity. This can be used to calculate the consistency of repeated measurements and thereby assess data quality. Recently, the properties of a correlation coefficient, CC1/2, that can be used for this purpose were characterized and it was shown that CC1/2 has superior properties compared with `merging' R values. A derived quantity, CC*, links data and model quality. Using experimental data sets, the behaviour of CC1/2 and the more conventional indicators were compared in two situations of practical importance: merging data sets from different crystals and selectively rejecting weak observations or (merged) unique reflections from a data set. In these situations controlled `paired-refinement' tests show that even though discarding the weaker data leads to improvements in the merging R values, the refined models based on these data are of lower quality. These results show the folly of such data-filtering practices aimed at improving the merging R values. Interestingly, in all of these tests CC1/2 is the one data-quality indicator for which the behaviour accurately reflects which of the alternative data-handling strategies results in the best-quality refined model. Its properties in the presence of systematic error are documented and discussed.

SUBMITTER: Diederichs K

PROVIDER: S-EPMC3689524 | biostudies-literature | 2013 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Better models by discarding data?

Diederichs K K Karplus P A PA

Acta crystallographica. Section D, Biological crystallography 20130615 Pt 7

In macromolecular X-ray crystallography, typical data sets have substantial multiplicity. This can be used to calculate the consistency of repeated measurements and thereby assess data quality. Recently, the properties of a correlation coefficient, CC1/2, that can be used for this purpose were characterized and it was shown that CC1/2 has superior properties compared with `merging' R values. A derived quantity, CC*, links data and model quality. Using experimental data sets, the behaviour of CC1 ...[more]

PMID: 23793147

Similar Datasets

Project description:Advances in bioacoustic technology, such as the use of automatic recording devices, allow wildlife monitoring at large spatial scales. However, such technology can produce enormous amounts of audio data that must be processed and analyzed. One potential solution to this problem is the use of automated sound recognition tools, but we lack a general framework for developing and validating these tools. Recognizers are computer models of an animal sound assembled from "training data" (i.e., actual samples of vocalizations). The settings of variables used to create recognizers can impact performance, and the use of different settings can result in large differences in error rates that can be exploited for different monitoring objectives. We used Song Scope (Wildlife Acoustics Inc.) to build recognizers and vocalizations of the wood frog (Lithobates sylvaticus) to test how different settings and amounts of training data influence recognizer performance. Performance was evaluated using precision (the probability of a recognizer match being a true match) and sensitivity (the proportion of vocalizations detected) based on a receiver operating characteristic (ROC) curve-determined score threshold. Evaluations were conducted using recordings not used to build the recognizer. Wood frog recognizer performance was sensitive to setting changes in four out of nine variables, and small improvements were achieved by using additional training data from different sites and from the same recording, but not from different recordings from the same site. Overall, the effect of changes to variable settings was much greater than the effect of increasing training data. Additionally, by testing the performance of the recognizer on vocalizations not used to build the recognizer, we discovered that Type I error rates appear idiosyncratic and do not recommend extrapolation from training to new data, whereas Type II errors showed more consistency and extrapolation can be justified. Optimizing variable settings on independent recordings led to a better match between recognizer performance and monitoring objectives. We provide general recommendations for application of this methodology with other species and make some suggestions for improvements.

Project description:ObjectiveThe accurate prediction of seizure freedom after epilepsy surgery remains challenging. We investigated if (1) training more complex models, (2) recruiting larger sample sizes, or (3) using data-driven selection of clinical predictors would improve our ability to predict postoperative seizure outcome using clinical features. We also conducted the first substantial external validation of a machine learning model trained to predict postoperative seizure outcome.MethodsWe performed a retrospective cohort study of 797 children who had undergone resective or disconnective epilepsy surgery at a tertiary center. We extracted patient information from medical records and trained three models-a logistic regression, a multilayer perceptron, and an XGBoost model-to predict 1-year postoperative seizure outcome on our data set. We evaluated the performance of a recently published XGBoost model on the same patients. We further investigated the impact of sample size on model performance, using learning curve analysis to estimate performance at samples up to N = 2000. Finally, we examined the impact of predictor selection on model performance.ResultsOur logistic regression achieved an accuracy of 72% (95% confidence interval [CI] = 68%-75%, area under the curve [AUC] = .72), whereas our multilayer perceptron and XGBoost both achieved accuracies of 71% (95% CIMLP = 67%-74%, AUCMLP = .70; 95% CIXGBoost own = 68%-75%, AUCXGBoost own = .70). There was no significant difference in performance between our three models (all p > .4) and they all performed better than the external XGBoost, which achieved an accuracy of 63% (95% CI = 59%-67%, AUC = .62; pLR = .005, pMLP = .01, pXGBoost own = .01) on our data. All models showed improved performance with increasing sample size, but limited improvements beyond our current sample. The best model performance was achieved with data-driven feature selection.SignificanceWe show that neither the deployment of complex machine learning models nor the assembly of thousands of patients alone is likely to generate significant improvements in our ability to predict postoperative seizure freedom. We instead propose that improved feature selection alongside collaboration, data standardization, and model sharing is required to advance the field.

Dataset Information

Better models by discarding data?

Publications

Better models by discarding data?

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets