Dataset Information

External validation of multivariable prediction models: a systematic review of methodological conduct and reporting.

ABSTRACT:

Background

Before considering whether to use a multivariable (diagnostic or prognostic) prediction model, it is essential that its performance be evaluated in data that were not used to develop the model (referred to as external validation). We critically appraised the methodological conduct and reporting of external validation studies of multivariable prediction models.

Methods

We conducted a systematic review of articles describing some form of external validation of one or more multivariable prediction models indexed in PubMed core clinical journals published in 2010. Study data were extracted in duplicate on design, sample size, handling of missing data, reference to the original study developing the prediction models and predictive performance measures.

Results

11,826 articles were identified and 78 were included for full review, which described the evaluation of 120 prediction models. in participant data that were not used to develop the model. Thirty-three articles described both the development of a prediction model and an evaluation of its performance on a separate dataset, and 45 articles described only the evaluation of an existing published prediction model on another dataset. Fifty-seven percent of the prediction models were presented and evaluated as simplified scoring systems. Sixteen percent of articles failed to report the number of outcome events in the validation datasets. Fifty-four percent of studies made no explicit mention of missing data. Sixty-seven percent did not report evaluating model calibration whilst most studies evaluated model discrimination. It was often unclear whether the reported performance measures were for the full regression model or for the simplified models.

Conclusions

The vast majority of studies describing some form of external validation of a multivariable prediction model were poorly reported with key details frequently not presented. The validation studies were characterised by poor design, inappropriate handling and acknowledgement of missing data and one of the most key performance measures of prediction models i.e. calibration often omitted from the publication. It may therefore not be surprising that an overwhelming majority of developed prediction models are not used in practice, when there is a dearth of well-conducted and clearly reported (external validation) studies describing their performance on independent participant data.

SUBMITTER: Collins GS

PROVIDER: S-EPMC3999945 | biostudies-literature | 2014 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

External validation of multivariable prediction models: a systematic review of methodological conduct and reporting.

Collins Gary S GS de Groot Joris A JA Dutton Susan S Omar Omar O Shanyinde Milensu M Tajar Abdelouahid A Voysey Merryn M Wharton Rose R Yu Ly-Mee LM Moons Karel G KG Altman Douglas G DG

BMC medical research methodology 20140319

<h4>Background</h4>Before considering whether to use a multivariable (diagnostic or prognostic) prediction model, it is essential that its performance be evaluated in data that were not used to develop the model (referred to as external validation). We critically appraised the methodological conduct and reporting of external validation studies of multivariable prediction models.<h4>Methods</h4>We conducted a systematic review of articles describing some form of external validation of one or more ...[more]

PMID: 24645774

Similar Datasets

Project description:BackgroundDescribe and evaluate the methodological conduct of prognostic prediction models developed using machine learning methods in oncology.MethodsWe conducted a systematic review in MEDLINE and Embase between 01/01/2019 and 05/09/2019, for studies developing a prognostic prediction model using machine learning methods in oncology. We used the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement, Prediction model Risk Of Bias ASsessment Tool (PROBAST) and CHecklist for critical Appraisal and data extraction for systematic Reviews of prediction Modelling Studies (CHARMS) to assess the methodological conduct of included publications. Results were summarised by modelling type: regression-, non-regression-based and ensemble machine learning models.ResultsSixty-two publications met inclusion criteria developing 152 models across all publications. Forty-two models were regression-based, 71 were non-regression-based and 39 were ensemble models. A median of 647 individuals (IQR: 203 to 4059) and 195 events (IQR: 38 to 1269) were used for model development, and 553 individuals (IQR: 69 to 3069) and 50 events (IQR: 17.5 to 326.5) for model validation. A higher number of events per predictor was used for developing regression-based models (median: 8, IQR: 7.1 to 23.5), compared to alternative machine learning (median: 3.4, IQR: 1.1 to 19.1) and ensemble models (median: 1.7, IQR: 1.1 to 6). Sample size was rarely justified (n = 5/62; 8%). Some or all continuous predictors were categorised before modelling in 24 studies (39%). 46% (n = 24/62) of models reporting predictor selection before modelling used univariable analyses, and common method across all modelling types. Ten out of 24 models for time-to-event outcomes accounted for censoring (42%). A split sample approach was the most popular method for internal validation (n = 25/62, 40%). Calibration was reported in 11 studies. Less than half of models were reported or made available.ConclusionsThe methodological conduct of machine learning based clinical prediction models is poor. Guidance is urgently needed, with increased awareness and education of minimum prediction modelling standards. Particular focus is needed on sample size estimation, development and validation analysis methods, and ensuring the model is available for independent validation, to improve quality of machine learning based clinical prediction models.

Project description:Background and purposePrediction models for outcome of patients with acute ischemic stroke who will undergo endovascular treatment have been developed to improve patient management. The aim of the current study is to provide an overview of preintervention models for functional outcome after endovascular treatment and to validate these models with data from daily clinical practice.MethodsWe systematically searched within Medline, Embase, Cochrane, Web of Science, to include prediction models. Models identified from the search were validated in the MR CLEAN (Multicenter Randomized Clinical Trial of Endovascular Treatment for Acute Ischemic Stroke in the Netherlands) registry, which includes all patients treated with endovascular treatment within 6.5 hours after stroke onset in the Netherlands between March 2014 and November 2017. Predictive performance was evaluated according to discrimination (area under the curve) and calibration (slope and intercept of the calibration curve). Good functional outcome was defined as a score of 0-2 or 0-3 on the modified Rankin Scale depending on the model.ResultsAfter screening 3468 publications, 19 models were included in this validation. Variables included in the models mainly addressed clinical and imaging characteristics at baseline. In the validation cohort of 3156 patients, discriminative performance ranged from 0.61 (SPAN-100 [Stroke Prognostication Using Age and NIH Stroke Scale]) to 0.80 (MR PREDICTS). Best-calibrated models were THRIVE (The Totaled Health Risks in Vascular Events; intercept -0.06 [95% CI, -0.14 to 0.02]; slope 0.84 [95% CI, 0.75-0.95]), THRIVE-c (intercept 0.08 [95% CI, -0.02 to 0.17]; slope 0.71 [95% CI, 0.65-0.77]), Stroke Checkerboard score (intercept -0.05 [95% CI, -0.13 to 0.03]; slope 0.97 [95% CI, 0.88-1.08]), and MR PREDICTS (intercept 0.43 [95% CI, 0.33-0.52]; slope 0.93 [95% CI, 0.85-1.01]).ConclusionsThe THRIVE-c score and MR PREDICTS both showed a good combination of discrimination and calibration and were, therefore, superior in predicting functional outcome for patients with ischemic stroke after endovascular treatment within 6.5 hours. Since models used different predictors and several models had relatively good predictive performance, the decision on which model to use in practice may also depend on simplicity of the model, data availability, and the comparability of the population and setting.

Project description:ObjectiveTo validate all diagnostic prediction models for ruling out pulmonary embolism that are easily applicable in primary care.DesignSystematic review followed by independent external validation study to assess transportability of retrieved models to primary care medicine.Setting300 general practices in the Netherlands.ParticipantsIndividual patient dataset of 598 patients with suspected acute pulmonary embolism in primary care.Main outcome measuresDiscriminative ability of all models retrieved by systematic literature search, assessed by calculation and comparison of C statistics. After stratification into groups with high and low probability of pulmonary embolism according to pre-specified model cut-offs combined with qualitative D-dimer test, sensitivity, specificity, efficiency (overall proportion of patients with low probability of pulmonary embolism), and failure rate (proportion of pulmonary embolism cases in group of patients with low probability) were calculated for all models.ResultsTen published prediction models for the diagnosis of pulmonary embolism were found. Five of these models could be validated in the primary care dataset: the original Wells, modified Wells, simplified Wells, revised Geneva, and simplified revised Geneva models. Discriminative ability was comparable for all models (range of C statistic 0.75-0.80). Sensitivity ranged from 88% (simplified revised Geneva) to 96% (simplified Wells) and specificity from 48% (revised Geneva) to 53% (simplified revised Geneva). Efficiency of all models was between 43% and 48%. Differences were observed between failure rates, especially between the simplified Wells and the simplified revised Geneva models (failure rates 1.2% (95% confidence interval 0.2% to 3.3%) and 3.1% (1.4% to 5.9%), respectively; absolute difference -1.98% (-3.33% to -0.74%)). Irrespective of the diagnostic prediction model used, three patients were incorrectly classified as having low probability of pulmonary embolism; pulmonary embolism was diagnosed only after referral to secondary care.ConclusionsFive diagnostic pulmonary embolism prediction models that are easily applicable in primary care were validated in this setting. Whereas efficiency was comparable for all rules, the Wells rules gave the best performance in terms of lower failure rates.

Project description:BackgroundMultivariable prediction models are used in oral health care to identify individuals with an increased likelihood of caries increment. The outcomes of the models should help to manage individualized interventions and to determine the periodicity of service. The objective was to review and critically appraise studies of multivariable prediction models of caries increment.MethodsLongitudinal studies that developed or validated prediction models of caries and expressed caries increment as a function of at least three predictors were included. PubMed, Cochrane Library, and Web of Science supplemented with reference lists of included studies were searched. Two reviewers independently extracted data using CHARMS (Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modelling Studies) and assessed risk of bias and concern regarding applicability using PROBAST (Prediction model Risk Of Bias ASessment Tool). Predictors were analysed and model performance was recalculated as estimated positive (LR +) and negative likelihood ratios (LR -) based on sensitivity and specificity presented in the studies included.ResultsAmong the 765 reports identified, 21 studies providing 66 prediction models fulfilled the inclusion criteria. Over 150 candidate predictors were considered, and 31 predictors remained in studies of final developmental models: caries experience, mutans streptococci in saliva, fluoride supplements, and visible dental plaque being the most common predictors. Predictive performances varied, providing LR + and LR - ranges of 0.78-10.3 and 0.0-1.1, respectively. Only four models of coronal caries and one root caries model scored LR + values of at least 5. All studies were assessed as having high risk of bias, generally due to insufficient number of outcomes in relation to candidate predictors and considerable uncertainty regarding predictor thresholds and measurements. Concern regarding applicability was low overall.ConclusionsThe review calls attention to several methodological deficiencies and the significant heterogeneity observed across the studies ruled out meta-analyses. Flawed or distorted study estimates lead to uncertainty about the prediction, which limits the models' usefulness in clinical decision-making. The modest performance of most models implies that alternative predictors should be considered, such as bacteria with acid tolerant properties.Trial registrationPROSPERO CRD#152,467 April 28, 2020.

Project description:ObjectiveTo systematically review the conduct and reporting of formula trials.DesignSystematic review.Data sourcesMedline, Embase, and Cochrane Central Register of Controlled Trials (CENTRAL) were searched from 1 January 2006 to 31 December 2020.Review methodsIntervention trials comparing at least two formula products in children less than three years of age were included, but not trials of human breast milk or fortifiers of breast milk. Data were extracted in duplicate and primary outcome data were synthesised for meta-analysis with a random effects model weighted by the inverse variance method. Risk of bias was evaluated with Cochrane risk of bias version 2.0, and risk of undermining breastfeeding was evaluated according to published consensus guidance. Primary outcomes of the trials included in the systematic review were identified from clinical trial registries, protocols, or trial publications.Results22 201 titles were screened and 307 trials were identified that were published between 2006 and 2020, of which 73 (24%) trials in 13 197 children were prospectively registered. Another 111 unpublished but registered trials in 17 411 children were identified. Detailed analysis was undertaken for 125 trials (23 757 children) published since 2015. Seventeen (14%) of these recently published trials were conducted independently of formula companies, 26 (21%) were prospectively registered with a clear aim and primary outcome, and authors or sponsors shared prospective protocols for 11 (9%) trials. Risk of bias was low in five (4%) and high in 100 (80%) recently published trials, mainly because of inappropriate exclusions from analysis and selective reporting. For 68 recently published superiority trials, a pooled standardised mean difference of 0.51 (range -0.43 to 3.29) was calculated with an asymmetrical funnel plot (Egger's test P<0.001), which reduced to 0.19 after correction for asymmetry. Primary outcomes were reported by authors as favourable in 86 (69%) trials, and 115 (92%) abstract conclusions were favourable. One of 38 (3%) trials in partially breastfed infants reported adequate support for breastfeeding and 14 of 87 (16%) trials in non-breastfed infants confirmed the decision not to breastfeed was firmly established before enrolment in the trial.ConclusionsThe results show that formula trials lack independence or transparency, and published outcomes are biased by selective reporting.Systematic review registrationPROSPERO 2018 CRD42018091928.

Project description:ObjectivesTo identify and assess the quality and accuracy of prognostic models for nephropathy and to validate these models in external cohorts of people with type 2 diabetes.DesignSystematic review and external validation.Data sourcesPubMed and Embase.Eligibility criteriaStudies describing the development of a model to predict the risk of nephropathy, applicable to people with type 2 diabetes.MethodsScreening, data extraction, and risk of bias assessment were done in duplicate. Eligible models were externally validated in the Hoorn Diabetes Care System (DCS) cohort (n=11 450) for the same outcomes for which they were developed. Risks of nephropathy were calculated and compared with observed risk over 2, 5, and 10 years of follow-up. Model performance was assessed based on intercept adjusted calibration and discrimination (Harrell's C statistic).Results41 studies included in the systematic review reported 64 models, 46 of which were developed in a population with diabetes and 18 in the general population including diabetes as a predictor. The predicted outcomes included albuminuria, diabetic kidney disease, chronic kidney disease (general population), and end stage renal disease. The reported apparent discrimination of the 46 models varied considerably across the different predicted outcomes, from 0.60 (95% confidence interval 0.56 to 0.64) to 0.99 (not available) for the models developed in a diabetes population and from 0.59 (not available) to 0.96 (0.95 to 0.97) for the models developed in the general population. Calibration was reported in 31 of the 41 studies, and the models were generally well calibrated. 21 of the 64 retrieved models were externally validated in the Hoorn DCS cohort for predicting risk of albuminuria, diabetic kidney disease, and chronic kidney disease, with considerable variation in performance across prediction horizons and models. For all three outcomes, however, at least two models had C statistics >0.8, indicating excellent discrimination. In a secondary external validation in GoDARTS (Genetics of Diabetes Audit and Research in Tayside Scotland), models developed for diabetic kidney disease outperformed those for chronic kidney disease. Models were generally well calibrated across all three prediction horizons.ConclusionsThis study identified multiple prediction models to predict albuminuria, diabetic kidney disease, chronic kidney disease, and end stage renal disease. In the external validation, discrimination and calibration for albuminuria, diabetic kidney disease, and chronic kidney disease varied considerably across prediction horizons and models. For each outcome, however, specific models showed good discrimination and calibration across the three prediction horizons, with clinically accessible predictors, making them applicable in a clinical setting.Systematic review registrationPROSPERO CRD42020192831.

Dataset Information

External validation of multivariable prediction models: a systematic review of methodological conduct and reporting.

Background

Methods

Results

Conclusions

Publications

External validation of multivariable prediction models: a systematic review of methodological conduct and reporting.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets