Dataset Information

Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection.

ABSTRACT:

Unlabelled

Background

Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the detectability of a model. In order to design a simulation study which efficiently takes architecture into account, a reliable metric is needed for model selection.

Results

We evaluate three metrics as predictors of relative model detection difficulty derived from previous works: (1) Penetrance table variance (PTV), (2) customized odds ratio (COR), and (3) our own Ease of Detection Measure (EDM), calculated from the penetrance values and respective genotype frequencies of each simulated genetic model. We evaluate the reliability of these metrics across three very different data search algorithms, each with the capacity to detect epistatic interactions. We find that a model's EDM and COR are each stronger predictors of model detection success than heritability.

Conclusions

This study formally identifies and evaluates metrics which quantify model detection difficulty. We utilize these metrics to intelligently select models from a population of potential architectures. This allows for an improved simulation study design which accounts for differences in detection difficulty attributed to model architecture. We implement the calculation and utilization of EDM and COR into GAMETES, an algorithm which rapidly and precisely generates pure, strict, n-locus epistatic models.

SUBMITTER: Urbanowicz RJ

PROVIDER: S-EPMC3549792 | biostudies-literature | 2012 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection.

Urbanowicz Ryan J RJ Kiralis Jeff J Fisher Jonathan M JM Moore Jason H JH

BioData mining 20120926 1

<h4>Unlabelled</h4><h4>Background</h4>Algorithms designed to detect complex genetic disease associations are initially evaluated using simulated datasets. Typical evaluations vary constraints that influence the correct detection of underlying models (i.e. number of loci, heritability, and minor allele frequency). Such studies neglect to account for model architecture (i.e. the unique specification and arrangement of penetrance values comprising the genetic model), which alone can influence the d ...[more]

PMID: 23014095

Similar Datasets

Project description:BackgroundThe statistical genetics phenomenon of epistasis is widely acknowledged to confound disease etiology. In order to evaluate strategies for detecting these complex multi-locus disease associations, simulation studies are required. The development of the GAMETES software for the generation of complex genetic models, has provided the means to randomly generate an architecturally diverse population of epistatic models that are both pure and strict, i.e. all n loci, but no fewer, are predictive of phenotype. Previous theoretical work characterizing complex genetic models has yet to examine pure, strict, epistasis which should be the most challenging to detect. This study addresses three goals: (1) Classify and characterize pure, strict, two-locus epistatic models, (2) Investigate the effect of model 'architecture' on detection difficulty, and (3) Explore how adjusting GAMETES constraints influences diversity in the generated models.ResultsIn this study we utilized a geometric approach to classify pure, strict, two-locus epistatic models by "shape". In total, 33 unique shape symmetry classes were identified. Using a detection difficulty metric, we found that model shape was consistently a significant predictor of model detection difficulty. Additionally, after categorizing shape classes by the number of edges in their shape projections, we found that this edge number was also significantly predictive of detection difficulty. Analysis of constraints within GAMETES indicated that increasing model population size can expand model class coverage but does little to change the range of observed difficulty metric scores. A variable population prevalence significantly increased the range of observed difficulty metric scores and, for certain constraints, also improved model class coverage.ConclusionsThese analyses further our theoretical understanding of epistatic relationships and uncover guidelines for the effective generation of complex models using GAMETES. Specifically, (1) we have characterized 33 shape classes by edge number, detection difficulty, and observed frequency (2) our results support the claim that model architecture directly influences detection difficulty, and (3) we found that GAMETES will generate a maximally diverse set of models with a variable population prevalence and a larger model population size. However, a model population size as small as 1,000 is likely to be sufficient.

Project description:BackgroundItem response theory (IRT; Lord & Novick, 1968) is a psychometric framework that can be used to model the likelihood that an individual will respond correctly to an item. Using archival data (Mirman et al., 2010), Fergadiotis, Kellough, and Hula (2015) estimated difficulty parameters for the Philadelphia Naming Test (PNT; Roach, Schwartz, Martin, Grewal, & Brecher, 1996) using the 1-parameter logistic IRT model. Although the use of IRT in test development is advantageous, its reliance on sample sizes exceeding 200 participants make it difficult to implement in aphasiology. Therefore, alternate means of estimating the item difficulty of confrontation naming test items warrant investigation. In a preliminary study aimed at automatic item calibration, Swiderski, Fergadiotis, and Hula (2016) regressed the difficulty parameters from the PNT on word length, age of acquisition (Kuperman et al., 2012), lexical frequency as quantified by the Log10CD index (Brysbaert & New, 2009), and naming latency (Székely et al., 2003). Although this model successfully explained a substantial proportion of variance in the PNT difficulty parameters, a substantial proportion (20%) of the response time data were missing. Further, only 39% of the picture stimuli from Székely and colleagues (2003) were identical to those on the PNT. Given that the IRT sample size requirements limit traditional calibration approaches in aphasiology and that the initial attempts in predicting IRT difficulty parameters in our pilot study were based on incomplete response time data this study has two specific aims.AimsTo estimate naming latencies for the 175 items on the PNT, and assess the utility of psycholinguistic variables and naming latencies for predicting item difficulty.Methods and proceduresUsing a speeded picture naming task we estimated mean naming latencies for the 175 items of the Philadelphia Naming test in 44 cognitively healthy adults. We then re-estimated the model reported by Swiderski et al (2016) with the new naming latency data.ResultsThe predictor variables described above accounted for a substantial proportion of the variance in the item difficulty parameters (Adj. R 2 = .692).ConclusionsIn this study we demonstrated that word length, age of acquisition, lexical frequency, and naming latency from neurotypical young adults usefully predict picture naming item difficulty in people with aphasia. These variables are readily available or easily obtained and the regression model reported may be useful for estimating confrontation naming item difficulty without the need for collection of response data from large samples of people with aphasia.

Dataset Information

Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection.

Unlabelled

Background

Results

Conclusions

Publications

Predicting the difficulty of pure, strict, epistatic models: metrics for simulated model selection.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets