Dataset Information

A balanced iterative random forest for gene selection from microarray data.

ABSTRACT:

Background

The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease.

Results

This paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children's Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ? 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists.

Conclusion

The designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive.

SUBMITTER: Anaissi A

PROVIDER: S-EPMC3766035 | biostudies-literature | 2013 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A balanced iterative random forest for gene selection from microarray data.

Anaissi Ali A Kennedy Paul J PJ Goyal Madhu M Catchpoole Daniel R DR

BMC bioinformatics 20130827

<h4>Background</h4>The wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease.<h4>Results</h4>This paper introduces a Balanced Iterative Random Forest ...[more]

PMID: 23981907

Dataset Information

A balanced iterative random forest for gene selection from microarray data.

Background

Results

Conclusion

Publications

A balanced iterative random forest for gene selection from microarray data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Random forest for gene selection and microarray data classification.
| S-EPMC3218317 | biostudies-literature

Gene selection and classification of microarray data using random forest.
| S-EPMC1363357 | biostudies-literature

Feature selection and classification of urinary mRNA microarray data by iterative random forest to diagnose renal fibrosis: a two-stage study.
| S-EPMC5206620 | biostudies-literature

Robustness of Random Forest-based gene selection methods.
| S-EPMC3897925 | biostudies-literature

Gene selection using iterative feature elimination random forests for survival outcomes.
| S-EPMC3495190 | biostudies-literature

Iterative rank-order normalization of gene expression microarray data.
| S-EPMC3651355 | biostudies-literature

Forward variable selection for random forest models
| S-EPMC10503461 | biostudies-literature

A stable gene selection in microarray data analysis.
| S-EPMC1524991 | biostudies-literature

Assessing stability of gene selection in microarray data analysis.
| S-EPMC1403808 | biostudies-literature

Using iterative random forest to find geospatial environmental and Sociodemographic predictors of suicide attempts.
| S-EPMC10433206 | biostudies-literature