Unknown

Dataset Information

0

An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.


ABSTRACT:

Motivation

There is growing momentum to develop statistical learning (SL) methods as an alternative to conventional genome-wide association studies (GWAS). Methods such as random forests (RF) and gradient boosting machine (GBM) result in variable importance measures that indicate how well each single-nucleotide polymorphism (SNP) predicts the phenotype. For RF, it has been shown that variable importance measures are systematically affected by minor allele frequency (MAF) and linkage disequilibrium (LD). To establish RF and GBM as viable alternatives for analyzing genome-wide data, it is necessary to address this potential bias and show that SL methods do not significantly under-perform conventional GWAS methods.

Results

Both LD and MAF have a significant impact on the variable importance measures commonly used in RF and GBM. Dividing SNPs into overlapping subsets with approximate linkage equilibrium and applying SL methods to each subset successfully reduces the impact of LD. A welcome side effect of this approach is a dramatic reduction in parallel computing time, increasing the feasibility of applying SL methods to large datasets. The created subsets also facilitate a potential correction for the effect of MAF using pseudocovariates. Simulations using simulated SNPs embedded in empirical data-assessing varying effect sizes, minor allele frequencies and LD patterns-suggest that the sensitivity to detect effects is often improved by subsetting and does not significantly under-perform the Armitage trend test, even under ideal conditions for the trend test.

Availability

Code for the LD subsetting algorithm and pseudocovariate correction is available at http://www.nd.edu/~glubke/code.html.

SUBMITTER: Walters R 

PROVIDER: S-EPMC3467741 | biostudies-literature | 2012 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

An integrated approach to reduce the impact of minor allele frequency and linkage disequilibrium on variable importance measures for genome-wide data.

Walters Raymond R   Laurin Charles C   Lubke Gitta H GH  

Bioinformatics (Oxford, England) 20120730 20


<h4>Motivation</h4>There is growing momentum to develop statistical learning (SL) methods as an alternative to conventional genome-wide association studies (GWAS). Methods such as random forests (RF) and gradient boosting machine (GBM) result in variable importance measures that indicate how well each single-nucleotide polymorphism (SNP) predicts the phenotype. For RF, it has been shown that variable importance measures are systematically affected by minor allele frequency (MAF) and linkage dise  ...[more]

Similar Datasets

| S-EPMC7212447 | biostudies-literature
| S-EPMC1665459 | biostudies-literature
| S-EPMC9143480 | biostudies-literature
| S-EPMC1560400 | biostudies-literature
| S-EPMC2638262 | biostudies-literature
| S-EPMC1855121 | biostudies-literature
| S-EPMC4143691 | biostudies-literature
| S-EPMC6082240 | biostudies-literature
| S-EPMC4012494 | biostudies-literature
| S-EPMC130040 | biostudies-literature