Dataset Information

The revival of the Gini importance?

ABSTRACT: Motivation:Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible split points and high minor allele frequency. Results:We set up a fast approach to debias impurity-based variable importance measures for classification, regression and survival forests. We show that it creates a variable importance measure which is unbiased with regard to the number of categories and minor allele frequency and almost as fast as the standard impurity importance. As a result, it is now possible to compute reliable importance estimates without the extra computing cost of permutations. Further, we combine the importance measure with a fast testing procedure, producing p-values for variable importance with almost no computational overhead to the creation of the random forest. Applications to gene expression and genome-wide association data show that the proposed method is powerful and computationally efficient. Availability and implementation:The procedure is included in the ranger package, available at https://cran.r-project.org/package=ranger and https://github.com/imbs-hl/ranger. Supplementary information:Supplementary data are available at Bioinformatics online.

SUBMITTER: Nembrini S

PROVIDER: S-EPMC6198850 | biostudies-other | 2018 Nov

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

The revival of the Gini importance?

Nembrini Stefano S König Inke R IR Wright Marvin N MN

Bioinformatics (Oxford, England) 20181101 21

<h4>Motivation</h4>Random forests are fast, flexible and represent a robust approach to analyze high dimensional data. A key advantage over alternative machine learning algorithms are variable importance measures, which can be used to identify relevant features or perform variable selection. Measures based on the impurity reduction of splits, such as the Gini importance, are popular because they are simple and fast to compute. However, they are biased in favor of variables with many possible spl ...[more]

PMID: 29757357

Dataset Information

The revival of the Gini importance?

Publications

The revival of the Gini importance?

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Importance of lymphatic and vascular invasion in colorectal cancers
| 2591177 | ecrin-mdr-crc

Clinicopathological Importance of Colorectal Medullary Carcinoma: Retrospective Cohort Study
| 2292108 | ecrin-mdr-crc

Importance of histone demethylation in adipogenic differentiation and function
2012-04-05 | GSE18600 | GEO

Biomedical importance of indoles.
| S-EPMC6270133 | biostudies-literature

Importance of histone demethylation in adipogenic differentiation and function
2012-04-04 | E-GEOD-18600 | biostudies-arrayexpress

Unique features and clinical importance of acute alloreactive immune responses
2018-06-27 | GSE111377 | GEO

Functional Importance of eRNAs for Estrogen-dependent Gene Transcriptional Activation
2013-06-04 | GSE45822 | GEO

Importance of AMPK in metformin suppression of liver glucose production
2019-11-13 | GSE114234 | GEO

Assessing the importance of target type for oligonucleotide microarray experiments
2010-07-27 | GSE23168 | GEO