Dataset Information

Power of data mining methods to detect genetic associations and interactions.

ABSTRACT: Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR).We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma.The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest.Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.

SUBMITTER: Molinaro AM

PROVIDER: S-EPMC3222116 | biostudies-literature | 2011

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Power of data mining methods to detect genetic associations and interactions.

Molinaro Annette M AM Carriero Nicholas N Bjornson Robert R Hartge Patricia P Rothman Nathaniel N Chatterjee Nilanjan N

Human heredity 20110917 2

<h4>Background</h4>Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the p ...[more]

PMID: 21934324

Similar Datasets

Project description:BackgroundThere is growing interest in examining the simultaneous effects of multiple exposures and, more generally, the effects of mixtures of exposures, as part of the exposome concept (being defined as the totality of human environmental exposures from conception onwards). Uncovering such combined effects is challenging owing to the large number of exposures, several of them being highly correlated. We performed a simulation study in an exposome context to compare the performance of several statistical methods that have been proposed to detect statistical interactions.MethodsSimulations were based on an exposome including 237 exposures with a realistic correlation structure. We considered several statistical regression-based methods, including two-step Environment-Wide Association Study (EWAS2), the Deletion/Substitution/Addition (DSA) algorithm, the Least Absolute Shrinkage and Selection Operator (LASSO), Group-Lasso INTERaction-NET (GLINTERNET), a three-step method based on regression trees and finally Boosted Regression Trees (BRT). We assessed the performance of each method in terms of model size, predictive ability, sensitivity and false discovery rate.ResultsGLINTERNET and DSA had better overall performance than the other methods, with GLINTERNET having better properties in terms of selecting the true predictors (sensitivity) and of predictive ability, while DSA had a lower number of false positives. In terms of ability to capture interaction terms, GLINTERNET and DSA had again the best performances, with the same trade-off between sensitivity and false discovery proportion. When GLINTERNET and DSA failed to select an exposure truly associated with the outcome, they tended to select a highly correlated one. When interactions were not present in the data, using variable selection methods that allowed for interactions had only slight costs in performance compared to methods that only searched for main effects.ConclusionsGLINTERNET and DSA provided better performance in detecting two-way interactions, compared to other existing methods.

Project description:To dissect common human diseases such as obesity and diabetes, a systematic approach is needed to study how genes interact with one another, and with genetic and environmental factors, to determine clinical end points or disease phenotypes. Bayesian networks provide a convenient framework for extracting relationships from noisy data and are frequently applied to large-scale data to derive causal relationships among variables of interest. Given the complexity of molecular networks underlying common human disease traits, and the fact that biological networks can change depending on environmental conditions and genetic factors, large datasets, generally involving multiple perturbations (experiments), are required to reconstruct and reliably extract information from these networks. With limited resources, the balance of coverage of multiple perturbations and multiple subjects in a single perturbation needs to be considered in the experimental design. Increasing the number of experiments, or the number of subjects in an experiment, is an expensive and time-consuming way to improve network reconstruction. Integrating multiple types of data from existing subjects might be more efficient. For example, it has recently been demonstrated that combining genotypic and gene expression data in a segregating population leads to improved network reconstruction, which in turn may lead to better predictions of the effects of experimental perturbations on any given gene. Here we simulate data based on networks reconstructed from biological data collected in a segregating mouse population and quantify the improvement in network reconstruction achieved using genotypic and gene expression data, compared with reconstruction using gene expression data alone. We demonstrate that networks reconstructed using the combined genotypic and gene expression data achieve a level of reconstruction accuracy that exceeds networks reconstructed from expression data alone, and that fewer subjects may be required to achieve this superior reconstruction accuracy. We conclude that this integrative genomics approach to reconstructing networks not only leads to more predictive network models, but also may save time and money by decreasing the amount of data that must be generated under any given condition of interest to construct predictive network models.

Project description:BACKGROUND: The local connectivity and global position of a protein in a protein interaction network are known to correlate with some of its functional properties, including its essentiality or dispensability. It is therefore of interest to extend this observation and examine whether network properties of two proteins considered simultaneously can determine their joint dispensability, i.e., their propensity for synthetic sick/lethal interaction. Accordingly, we examine the predictive power of protein interaction networks for synthetic genetic interaction in Saccharomyces cerevisiae, an organism in which high confidence protein interaction networks are available and synthetic sick/lethal gene pairs have been extensively identified. RESULTS: We design a support vector machine system that uses graph-theoretic properties of two proteins in a protein interaction network as input features for prediction of synthetic sick/lethal interactions. The system is trained on interacting and non-interacting gene pairs culled from large scale genetic screens as well as literature-curated data. We find that the method is capable of predicting synthetic genetic interactions with sensitivity and specificity both exceeding 85%. We further find that the prediction performance is reasonably robust with respect to errors in the protein interaction network and with respect to changes in the features of test datasets. Using the prediction system, we carried out novel predictions of synthetic sick/lethal gene pairs at a genome-wide scale. These pairs appear to have functional properties that are similar to those that characterize the known synthetic lethal gene pairs. CONCLUSION: Our analysis shows that protein interaction networks can be used to predict synthetic lethal interactions with accuracies on par with or exceeding that of other computational methods that use a variety of input features, including functional annotations. This indicates that protein interaction networks could plausibly be rich sources of information about epistatic effects among genes.

Project description:Uncovering the roles of biotic interactions in assembling and maintaining species-rich communities remains a major challenge in ecology. In plant communities, interactions between individuals of different species are expected to generate positive or negative spatial interspecific associations over short distances. Recent studies using individual-based point pattern datasets have concluded that (a) detectable interspecific interactions are generally rare, but (b) are most common in communities with fewer species; and (c) the most abundant species tend to have the highest frequency of interactions. However, it is unclear how the detection of spatial interactions may change with the abundances of each species, or the scale and intensity of interactions. We ask if statistical power is sufficient to explain all three key results.We use a simple two-species model, assuming no habitat associations, and where the abundances, scale and intensity of interactions are controlled to simulate point pattern data. In combination with an approximation to the variance of the spatial summary statistics that we sample, we investigate the power of current spatial point pattern methods to correctly reject the null model of pairwise species independence.We show the power to detect interactions is positively related to both the abundances of the species tested, and the intensity and scale of interactions, but negatively related to imbalance in abundances. Differences in detection power in combination with the abundance distributions found in natural communities are sufficient to explain all the three key empirical results, even if all pairwise interactions are identical. Critically, many hundreds of individuals of both species may be required to detect even intense interactions, implying current abundance thresholds for including species in the analyses are too low. Sy n thesis. The widespread failure to reject the null model of spatial interspecific independence could be due to low power of the tests rather than any key biological process. Since we do not model habitat associations, our results represent a first step in quantifying sample sizes required to make strong statements about the role of biotic interactions in diverse plant communities. However, power should be factored into analyses and considered when designing empirical studies.

Dataset Information

Power of data mining methods to detect genetic associations and interactions.

Publications

Power of data mining methods to detect genetic associations and interactions.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets