Dataset Information

Efficient identification of context dependent subgroups of risk from genome-wide association studies.

ABSTRACT: We have developed a modified Patient Rule-Induction Method (PRIM) as an alternative strategy for analyzing representative samples of non-experimental human data to estimate and test the role of genomic variations as predictors of disease risk in etiologically heterogeneous sub-samples. A computational limit of the proposed strategy is encountered when the number of genomic variations (predictor variables) under study is large (>500) because permutations are used to generate a null distribution to test the significance of a term (defined by values of particular variables) that characterizes a sub-sample of individuals through the peeling and pasting processes. As an alternative, in this paper we introduce a theoretical strategy that facilitates the quick calculation of Type I and Type II errors in the evaluation of terms in the peeling and pasting processes carried out in the execution of a PRIM analysis that are under-estimated and non-existent, respectively, when a permutation-based hypothesis test is employed. The resultant savings in computational time makes possible the consideration of larger numbers of genomic variations (an example genome-wide association study is given) in the selection of statistically significant terms in the formulation of PRIM prediction models.

SUBMITTER: Dyson G

PROVIDER: S-EPMC4171947 | biostudies-literature | 2014 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Efficient identification of context dependent subgroups of risk from genome-wide association studies.

Dyson Greg G Sing Charles F CF

Statistical applications in genetics and molecular biology 20140401 2

We have developed a modified Patient Rule-Induction Method (PRIM) as an alternative strategy for analyzing representative samples of non-experimental human data to estimate and test the role of genomic variations as predictors of disease risk in etiologically heterogeneous sub-samples. A computational limit of the proposed strategy is encountered when the number of genomic variations (predictor variables) under study is large (>500) because permutations are used to generate a null distribution t ...[more]

PMID: 24570412

Similar Datasets

Project description:Until recently, genome-wide association studies (GWAS) have been restricted to research groups with the budget necessary to genotype hundreds, if not thousands, of samples. Replacing individual genotyping with genotyping of DNA pools in Phase I of a GWAS has proven successful, and dramatically altered the financial feasibility of this approach. When conducting a pool-based GWAS, how well SNP allele frequency is estimated from a DNA pool will influence a study's power to detect associations. Here we address how to control the variance in allele frequency estimation when DNAs are pooled, and how to plan and conduct the most efficient well-powered pool-based GWAS.By examining the variation in allele frequency estimation on SNP arrays between and within DNA pools we determine how array variance [var(e(array))] and pool-construction variance [var(e(construction))] contribute to the total variance of allele frequency estimation. This information is useful in deciding whether replicate arrays or replicate pools are most useful in reducing variance. Our analysis is based on 27 DNA pools ranging in size from 74 to 446 individual samples, genotyped on a collective total of 128 Illumina beadarrays: 24 1M-Single, 32 1M-Duo, and 72 660-Quad.For all three Illumina SNP array types our estimates of var(e(array)) were similar, between 3-4 × 10-4 for normalized data. Var(e(construction)) accounted for between 20-40% of pooling variance across 27 pools in normalized data.We conclude that relative to var(e(array)), var(e(construction)) is of less importance in reducing the variance in allele frequency estimation from DNA pools; however, our data suggests that on average it may be more important than previously thought. We have prepared a simple online tool, PoolingPlanner (available at http://www.kchew.ca/PoolingPlanner/), which calculates the effective sample size (ESS) of a DNA pool given a range of replicate array values. ESS can be used in a power calculator to perform pool-adjusted calculations. This allows one to quickly calculate the loss of power associated with a pooling experiment to make an informed decision on whether a pool-based GWAS is worth pursuing.

Dataset Information

Efficient identification of context dependent subgroups of risk from genome-wide association studies.

Publications

Efficient identification of context dependent subgroups of risk from genome-wide association studies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets