Unknown

Dataset Information

0

Correction for population stratification in random forest analysis.


ABSTRACT: Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriately corrected, RF tends to give high importance to disease-unrelated SNPs with different frequencies of allele or genotype among subpopulations, leading to inaccurate results.In this study, the authors propose to correct for the confounding effect of PS by including the information of PS in RF analysis. The correction procedure starts by extracting the information of PS using EIGENSTRAT or multi-dimensional scaling clustering procedure from a large number of structure inference SNPs. Phenotype and genotypes adjusted by the information of PS are then used as the outcome and predictors in RF analysis.Extensive simulations indicate that the importance measure of the causal SNP is increased following the PS correction. By analysing a real dataset, the proposed correction removes the spurious association between the lactase gene and height.The authors propose a simple method to correct for PS in RF analysis on GWAS data. Further studies in real GWAS datasets are required to validate the robustness of the proposed approach.

SUBMITTER: Zhao Y 

PROVIDER: S-EPMC3535752 | biostudies-literature | 2012 Dec

REPOSITORIES: biostudies-literature

altmetric image

Publications

Correction for population stratification in random forest analysis.

Zhao Yang Y   Chen Feng F   Zhai Rihong R   Lin Xihong X   Wang Zhaoxi Z   Su Li L   Christiani David C DC  

International journal of epidemiology 20121112 6


<h4>Background</h4>Population structure (PS), including population stratification and admixture, is a significant confounder in genome-wide association studies (GWAS), as it may produce spurious associations. Random forest (RF) has been increasingly applied in GWAS data analysis because of its advantage in analysing high dimensional genetic data. RF creates importance measures for single nucleotide polymorphisms (SNPs), which are helpful for feature selections. However, if PS is not appropriatel  ...[more]

Similar Datasets

| S-EPMC11373406 | biostudies-literature
| S-EPMC8263609 | biostudies-literature
| S-EPMC2198793 | biostudies-literature
| S-EPMC3671578 | biostudies-literature
| S-EPMC10794901 | biostudies-literature
| S-EPMC5775496 | biostudies-literature
| S-EPMC1852732 | biostudies-literature
| S-EPMC8575902 | biostudies-literature
2012-05-10 | GSE37858 | GEO