Dataset Information

Application of two machine learning algorithms to genetic association studies in the presence of covariates.

ABSTRACT:

Background

Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized.

Methods and results

In this manuscript, we investigate two approaches: Random Forests (RFs) and Multivariate Adaptive Regression Splines (MARS). Through multiple simulation studies, the performance under several underlying models is evaluated. An application to a cohort of HIV-1 infected individuals receiving anti-retroviral therapies is also provided.

Conclusion

Consistent with more traditional regression modeling theory, our findings highlight the importance of considering the nature of underlying gene-covariate-trait relationships before applying ML algorithms, particularly when there is potential confounding or effect mediation.

SUBMITTER: Nonyane BA

PROVIDER: S-EPMC2620353 | biostudies-literature | 2008 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Application of two machine learning algorithms to genetic association studies in the presence of covariates.

Nonyane Bareng A S BA Foulkes Andrea S AS

BMC genetics 20081114

<h4>Background</h4>Population-based investigations aimed at uncovering genotype-trait associations often involve high-dimensional genetic polymorphism data as well as information on multiple environmental and clinical parameters. Machine learning (ML) algorithms offer a straightforward analytic approach for selecting subsets of these inputs that are most predictive of a pre-defined trait. The performance of these algorithms, however, in the presence of covariates is not well characterized.<h4>Me ...[more]

PMID: 19014573

Similar Datasets

Project description:Magnetic resonance imaging (MRI) allows non-invasive evaluation of inflammatory bowel disease (IBD) by assessing pathologically altered gut. Besides morphological changes, relaxation times and diffusion capacity of involved bowel segments can be obtained by MRI. The aim of this study was to assess the use of multiparametric MRI in the diagnosis of experimentally induced colitis in mice, and evaluate the diagnostic benefit of parameter combinations using machine learning. This study relied on colitis induction by Dextran Sodium Sulfate (DSS) and investigated the colon of mice in vivo as well as ex vivo. Receiver Operating Characteristics were used to calculate sensitivity, specificity, positive- and negative-predictive values (PPV and NPV) of these single values in detecting DSS-treatment as a reference condition. A Model Averaged Neural Network (avNNet) was trained on the multiparametric combination of the measured values, and its predictive capacity was compared to those of the single parameters using exact binomial tests. Within the in vivo subgroup (n = 19), the avNNet featured a sensitivity of 91.3% (95% CI: 86.6-96.0%), specificity of 92.3% (95% CI: 85.1-99.6%), PPV of 96.9% (94.0-99.9%) and NPV of 80.0% (95% CI: 69.9-90.1%), significantly outperforming all single parameters in at least 2 accuracy measures (p < 0.003) and performing significantly worse compared to none of the single values. Within the ex vivo subgroup (n = 30), the avNNet featured a sensitivity of 87.4% (95% CI: 82.6-92.2%), specificity of 82.9% (95% CI: 76.1-89.7%), PPV of 88.9% (84.3-93.5%) and NPV of 80.8% (95% CI: 73.8-87.9%), significantly outperforming all single parameters in at least 2 accuracy measures (p < 0.015), exceeded by none of the single parameters. In experimental mouse colitis, multiparametric MRI and the combination of several single measured values to an avNNet can significantly increase diagnostic accuracy compared to the single parameters alone. This pilot study will provide new avenues for the development of an MR-derived colitis score for optimized diagnosis and surveillance of inflammatory bowel disease.

Project description:BackgroundMachine learning methods and conventions are increasingly employed for the analysis of large, complex biomedical data sets, including genome-wide association studies (GWAS). Reproducibility of machine learning analyses of GWAS can be hampered by biological and statistical factors, particularly so for the investigation of non-additive genetic interactions. Application of traditional cross validation to a GWAS data set may result in poor consistency between the training and testing data set splits due to an imbalance of the interaction genotypes relative to the data as a whole. We propose a new cross validation method, proportional instance cross validation (PICV), that preserves the original distribution of an independent variable when splitting the data set into training and testing partitions.ResultsWe apply PICV to simulated GWAS data with epistatic interactions of varying minor allele frequencies and prevalences and compare performance to that of a traditional cross validation procedure in which individuals are randomly allocated to training and testing partitions. Sensitivity and positive predictive value are significantly improved across all tested scenarios for PICV compared to traditional cross validation. We also apply PICV to GWAS data from a study of primary open-angle glaucoma to investigate a previously-reported interaction, which fails to significantly replicate; PICV however improves the consistency of testing and training results.ConclusionsApplication of traditional machine learning procedures to biomedical data may require modifications to better suit intrinsic characteristics of the data, such as the potential for highly imbalanced genotype distributions in the case of epistasis detection. The reproducibility of genetic interaction findings can be improved by considering this variable imbalance in cross validation implementation, such as with PICV. This approach may be extended to problems in other domains in which imbalanced variable distributions are a concern.

Project description:Recent substantial advances in high-throughput field phenotyping have provided plant breeders with affordable and efficient tools for evaluating a large number of genotypes for important agronomic traits at early growth stages. Nevertheless, the implementation of large datasets generated by high-throughput phenotyping tools such as hyperspectral reflectance in cultivar development programs is still challenging due to the essential need for intensive knowledge in computational and statistical analyses. In this study, the robustness of three common machine learning (ML) algorithms, multilayer perceptron (MLP), support vector machine (SVM), and random forest (RF), were evaluated for predicting soybean (Glycine max) seed yield using hyperspectral reflectance. For this aim, the hyperspectral reflectance data for the whole spectra ranged from 395 to 1005 nm, which were collected at the R4 and R5 growth stages on 250 soybean genotypes grown in four environments. The recursive feature elimination (RFE) approach was performed to reduce the dimensionality of the hyperspectral reflectance data and select variables with the largest importance values. The results indicated that R5 is more informative stage for measuring hyperspectral reflectance to predict seed yields. The 395 nm reflectance band was also identified as the high ranked band in predicting the soybean seed yield. By considering either full or selected variables as the input variables, the ML algorithms were evaluated individually and combined-version using the ensemble-stacking (E-S) method to predict the soybean yield. The RF algorithm had the highest performance with a value of 84% yield classification accuracy among all the individual tested algorithms. Therefore, by selecting RF as the metaClassifier for E-S method, the prediction accuracy increased to 0.93, using all variables, and 0.87, using selected variables showing the success of using E-S as one of the ensemble techniques. This study demonstrated that soybean breeders could implement E-S algorithm using either the full or selected spectra reflectance to select the high-yielding soybean genotypes, among a large number of genotypes, at early growth stages.

Dataset Information

Application of two machine learning algorithms to genetic association studies in the presence of covariates.

Background

Methods and results

Conclusion

Publications

Application of two machine learning algorithms to genetic association studies in the presence of covariates.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets