Dataset Information

Gene selection and classification of microarray data using random forest.

ABSTRACT:

Background

Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class problems, and use gene selection ranking criteria unrelated to the classification algorithm. In contrast, random forest is a classification algorithm well suited for microarray data: it shows excellent performance even when most predictive variables are noise, can be used when the number of variables is much larger than the number of observations and in problems involving more than two classes, and returns measures of variable importance. Thus, it is important to understand the performance of random forest with microarray data and its possible use for gene selection.

Results

We investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. Using simulated and nine microarray data sets we show that random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM, and that the new gene selection procedure yields very small sets of genes (often smaller than alternative methods) while preserving predictive accuracy.

Conclusion

Because of its performance and features, random forest and gene selection using random forest should probably become part of the "standard tool-box" of methods for class prediction and gene selection with microarray data.

SUBMITTER: Diaz-Uriarte R

PROVIDER: S-EPMC1363357 | biostudies-literature | 2006 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Gene selection and classification of microarray data using random forest.

Díaz-Uriarte Ramón R Alvarez de Andrés Sara S

BMC bioinformatics 20060106

<h4>Background</h4>Selection of relevant genes for sample classification is a common task in most gene expression studies, where researchers try to identify the smallest possible set of genes that can still achieve good predictive performance (for instance, for future use with diagnostic purposes in clinical practice). Many gene selection approaches use univariate (gene-by-gene) rankings of gene relevance and arbitrary thresholds to select the number of genes, can only be applied to two-class pr ...[more]

PMID: 16398926

Similar Datasets

Project description:BackgroundThe wealth of gene expression values being generated by high throughput microarray technologies leads to complex high dimensional datasets. Moreover, many cohorts have the problem of imbalanced classes where the number of patients belonging to each class is not the same. With this kind of dataset, biologists need to identify a small number of informative genes that can be used as biomarkers for a disease.ResultsThis paper introduces a Balanced Iterative Random Forest (BIRF) algorithm to select the most relevant genes for a disease from imbalanced high-throughput gene expression microarray data. Balanced iterative random forest is applied on four cancer microarray datasets: a childhood leukaemia dataset, which represents the main target of this paper, collected from The Children's Hospital at Westmead, NCI 60, a Colon dataset and a Lung cancer dataset. The results obtained by BIRF are compared to those of Support Vector Machine-Recursive Feature Elimination (SVM-RFE), Multi-class SVM-RFE (MSVM-RFE), Random Forest (RF) and Naive Bayes (NB) classifiers. The results of the BIRF approach outperform these state-of-the-art methods, especially in the case of imbalanced datasets. Experiments on the childhood leukaemia dataset show that a 7% ∼ 12% better accuracy is achieved by BIRF over MSVM-RFE with the ability to predict patients in the minor class. The informative biomarkers selected by the BIRF algorithm were validated by repeating training experiments three times to see whether they are globally informative, or just selected by chance. The results show that 64% of the top genes consistently appear in the three lists, and the top 20 genes remain near the top in the other three lists.ConclusionThe designed BIRF algorithm is an appropriate choice to select genes from imbalanced high-throughput gene expression microarray data. BIRF outperforms the state-of-the-art methods, especially the ability to handle the class-imbalanced data. Moreover, the analysis of the selected genes also provides a way to distinguish between the predictive genes and those that only appear to be predictive.

Project description:Background: Psoriasis is a chronic and immune-mediated skin disorder that currently has no cure. Pyroptosis has been proved to be involved in the pathogenesis and progression of psoriasis. However, the role pyroptosis plays in psoriasis remains elusive. Methods: RNA-sequencing data of psoriasis patients were obtained from the Gene Expression Omnibus (GEO) database, and differentially expressed pyroptosis-related genes (PRGs) between psoriasis patients and normal individuals were obtained. A principal component analysis (PCA) was conducted to determine whether PRGs could be used to distinguish the samples. PRG and immune cell correlation was also investigated. Subsequently, a novel diagnostic model comprising PRGs for psoriasis was constructed using a random forest algorithm (ntree = 400). A receiver operating characteristic (ROC) analysis was used to evaluate the classification performance through both internal and external validation. Consensus clustering analysis was used to investigate whether there was a difference in biological functions within PRG-based subtypes. Finally, the expression of the kernel PRGs were validated in vivo by qRT-PCR. Results: We identified a total of 39 PRGs, which could distinguish psoriasis samples from normal samples. The process of T cell CD4 memory activated and mast cells resting were correlated with PRGs. Ten PRGs, IL-1β, AIM2, CASP5, DHX9, CASP4, CYCS, CASP1, GZMB, CHMP2B, and CASP8, were subsequently screened using a random forest diagnostic model. ROC analysis revealed that our model has good diagnostic performance in both internal validation (area under the curve [AUC] = 0.930 [95% CI 0.877-0.984]) and external validation (mean AUC = 0.852). PRG subtypes indicated differences in metabolic processes and the MAPK signaling pathway. Finally, the qRT-PCR results demonstrated the apparent dysregulation of PRGs in psoriasis, especially AIM2 and GZMB. Conclusion: Pyroptosis may play a crucial role in psoriasis and could provide new insights into the diagnosis and underlying mechanisms of psoriasis.

Dataset Information

Gene selection and classification of microarray data using random forest.

Background

Results

Conclusion

Publications

Gene selection and classification of microarray data using random forest.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets