Dataset Information

EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis.

ABSTRACT:

Motivation

We developed an EM-random forest (EMRF) for Haseman-Elston quantitative trait linkage analysis that accounts for marker ambiguity and weighs each sib-pair according to the posterior identical by descent (IBD) distribution. The usual random forest (RF) variable importance (VI) index used to rank markers for variable selection is not optimal when applied to linkage data because of correlation between markers. We define new VI indices that borrow information from linked markers using the correlation structure inherent in IBD linkage data.

Results

Using simulations, we find that the new VI indices in EMRF performed better than the original RF VI index and performed similarly or better than EM-Haseman-Elston regression LOD score for various genetic models. Moreover, tree size and markers subset size evaluated at each node are important considerations in RFs.

Availability

The source code for EMRF written in C is available at www.infornomics.utoronto.ca/downloads/EMRF.

SUBMITTER: Lee SS

PROVIDER: S-EPMC2638262 | biostudies-literature | 2008 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis.

Lee Sophia S F SS Sun Lei L Kustra Rafal R Bull Shelley B SB

Bioinformatics (Oxford, England) 20080521 14

<h4>Motivation</h4>We developed an EM-random forest (EMRF) for Haseman-Elston quantitative trait linkage analysis that accounts for marker ambiguity and weighs each sib-pair according to the posterior identical by descent (IBD) distribution. The usual random forest (RF) variable importance (VI) index used to rank markers for variable selection is not optimal when applied to linkage data because of correlation between markers. We define new VI indices that borrow information from linked markers u ...[more]

PMID: 18499695

Similar Datasets

Project description:BackgroundVariable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.ResultsSimulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand.ConclusionWe propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research.

Dataset Information

EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis.

Motivation

Results

Availability

Publications

EM-random forest and new measures of variable importance for multi-locus quantitative trait linkage analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets