Unknown

Dataset Information

0

The behaviour of random forest permutation-based variable importance measures under predictor correlation.


ABSTRACT:

Background

Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results.

Results

In the case when both predictor correlation was present and predictors were associated with the outcome (HA), the unconditional RF VIM attributed a higher share of importance to correlated predictors, while under the null hypothesis that no predictors are associated with the outcome (H0) the unconditional RF VIM was unbiased. Conditional VIMs showed a decrease in VIM values for correlated predictors versus the unconditional VIMs under HA and was unbiased under H0. Scaled VIMs were clearly biased under HA and H0.

Conclusions

Unconditional unscaled VIMs are a computationally tractable choice for large datasets and are unbiased under the null hypothesis. Whether the observed increased VIMs for correlated predictors may be considered a "bias" - because they do not directly reflect the coefficients in the generating model - or if it is a beneficial attribute of these VIMs is dependent on the application. For example, in genetic association studies, where correlation between markers may help to localize the functionally relevant variant, the increased importance of correlated predictors may be an advantage. On the other hand, we show examples where this increased importance may result in spurious signals.

SUBMITTER: Nicodemus KK 

PROVIDER: S-EPMC2848005 | biostudies-literature | 2010 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

The behaviour of random forest permutation-based variable importance measures under predictor correlation.

Nicodemus Kristin K KK   Malley James D JD   Strobl Carolin C   Ziegler Andreas A  

BMC bioinformatics 20100227


<h4>Background</h4>Random forests (RF) have been increasingly used in applications such as genome-wide association and microarray studies where predictor correlation is frequently observed. Recent works on permutation-based variable importance measures (VIMs) used in RF have come to apparently contradictory conclusions. We present an extended simulation study to synthesize results.<h4>Results</h4>In the case when both predictor correlation was present and predictors were associated with the outc  ...[more]

Similar Datasets

| S-EPMC1796903 | biostudies-other
| S-EPMC3626572 | biostudies-literature
| S-EPMC2638262 | biostudies-literature
| S-EPMC6051549 | biostudies-literature
| S-EPMC10497997 | biostudies-literature
| S-EPMC4822295 | biostudies-literature
| S-EPMC7054385 | biostudies-literature
| S-EPMC4955772 | biostudies-literature
| S-EPMC8554859 | biostudies-literature
| S-EPMC7508310 | biostudies-literature