Dataset Information

Random forest versus logistic regression: a large-scale benchmark experiment.

ABSTRACT:

Background and goal

The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields.

Results

In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases.

Conclusion

RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and -?0.027 (95%-CI =[-0.034,-0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values.

SUBMITTER: Couronne R

PROVIDER: S-EPMC6050737 | biostudies-literature | 2018 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Random forest versus logistic regression: a large-scale benchmark experiment.

Couronné Raphael R Probst Philipp P Boulesteix Anne-Laure AL

BMC bioinformatics 20180717 1

<h4>Background and goal</h4>The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields.<h4>Results</h4>In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parame ...[more]

PMID: 30016950

Similar Datasets

Project description:BackgroundThis study illustrates the use of logistic regression and machine learning methods, specifically random forest models, in health services research by analyzing outcomes for a cohort of patients with concomitant peripheral artery disease and diabetes mellitus.MethodsCohort study using fee-for-service Medicare beneficiaries in 2015 who were newly diagnosed with peripheral artery disease and diabetes mellitus. Exposure variables include whether patients received preventive measures in the 6 months following their index date: HbA1c test, foot exam, or vascular imaging study. Outcomes include any reintervention, lower extremity amputation, and death. We fit both logistic regression models as well as random forest models.ResultsThere were 88,898 fee-for-service Medicare beneficiaries diagnosed with peripheral artery disease and diabetes mellitus in our cohort. The rate of preventative treatments in the first six months following diagnosis were 52% (n = 45,971) with foot exams, 43% (n = 38,393) had vascular imaging, and 50% (n = 44,181) had an HbA1c test. The directionality of the influence for all covariates considered matched those results found with the random forest and logistic regression models. The most predictive covariate in each approach differs as determined by the t-statistics from logistic regression and variable importance (VI) in the random forest model. For amputation we see age 85 + (t = 53.17) urban-residing (VI = 83.42), and for death (t = 65.84, VI = 88.76) and reintervention (t = 34.40, VI = 81.22) both models indicate age is most predictive.ConclusionsThe use of random forest models to analyze data and provide predictions for patients holds great potential in identifying modifiable patient-level and health-system factors and cohorts for increased surveillance and intervention to improve outcomes for patients. Random forests are incredibly high performing models with difficult interpretation most ideally suited for times when accurate prediction is most desirable and can be used in tandem with more common approaches to provide a more thorough analysis of observational data.

Dataset Information

Random forest versus logistic regression: a large-scale benchmark experiment.

Background and goal

Results

Conclusion

Publications

Random forest versus logistic regression: a large-scale benchmark experiment.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets