Dataset Information

Multi-model inference using mixed effects from a linear regression based genetic algorithm.

ABSTRACT: BACKGROUND: Different high-dimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixed-effects modeling (MM) is requested, but may not always be straightforward to implement.In this article, we developed such a MM extension (GA-MM-MMI) for the automated variable selection by a linear regression based genetic algorithm (GA) using multi-model inference (MMI). We exemplify our approach by training a linear regression model for prediction of resistance to the integrase inhibitor Raltegravir (RAL) on a genotype-phenotype database, with many integrase mutations as candidate covariates. The genotype-phenotype pairs in this database were derived from a limited number of subjects, with presence of multiple data points from the same subject, and with an intra-class correlation of 0.92. RESULTS: In generation of the RAL model, we took computational efficiency into account by optimizing the GA parameters one by one, and by using tournament selection. To derive the main GA parameters we used 3 times 5-fold cross-validation. The number of integrase mutations to be used as covariates in the mixed effects models was 25 (chrom.size). A GA solution was found when R2MM > 0.95 (goal.fitness). We tested three different MMI approaches to combine the results of 100 GA solutions into one GA-MM-MMI model. When evaluating the GA-MM-MMI performance on two unseen data sets, a more parsimonious and interpretable model was found (GA-MM-MMI TOP18: mixed-effects model containing the 18 most prevalent mutations in the GA solutions, refitted on the training data) with better predictive accuracy (R2) in comparison to GA-ordinary least squares (GA-OLS) and Least Absolute Shrinkage and Selection Operator (LASSO). CONCLUSIONS: We have demonstrated improved performance when using GA-MM-MMI for selection of mutations on a genotype-phenotype data set. As we largely automated setting the GA parameters, the method should be applicable on similar datasets with clustered observations.

SUBMITTER: Van der Borght K

PROVIDER: S-EPMC3987104 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Multi-model inference using mixed effects from a linear regression based genetic algorithm.

Van der Borght Koen K Verbeke Geert G van Vlijmen Herman H

BMC bioinformatics 20140327

<h4>Background</h4>Different high-dimensional regression methodologies exist for the selection of variables to predict a continuous variable. To improve the variable selection in case clustered observations are present in the training data, an extension towards mixed-effects modeling (MM) is requested, but may not always be straightforward to implement.In this article, we developed such a MM extension (GA-MM-MMI) for the automated variable selection by a linear regression based genetic algorithm ...[more]

PMID: 24669828

Similar Datasets

Project description:BACKGROUND:Self-contained tests estimate and test the association between a phenotype and mean expression level in a gene set defined a priori. Many self-contained gene set analysis methods have been developed but the performance of these methods for phenotypes that are continuous rather than discrete and with multiple nuisance covariates has not been well studied. Here, I use Monte Carlo simulation to evaluate the performance of both novel and previously published (and readily available via R) methods for inferring effects of a continuous predictor on mean expression in the presence of nuisance covariates. The motivating data are a high-profile dataset which was used to show opposing effects of hedonic and eudaimonic well-being (or happiness) on the mean expression level of a set of genes that has been correlated with social adversity (the CTRA gene set). The original analysis of these data used a linear model (GLS) of fixed effects with correlated error to infer effects of Hedonia and Eudaimonia on mean CTRA expression. METHODS:The standardized effects of Hedonia and Eudaimonia on CTRA gene set expression estimated by GLS were compared to estimates using multivariate (OLS) linear models and generalized estimating equation (GEE) models. The OLS estimates were tested using O'Brien's OLS test, Anderson's permutation [Formula: see text]-test, two permutation F-tests (including GlobalAncova), and a rotation z-test (Roast). The GEE estimates were tested using a Wald test with robust standard errors. The performance (Type I, II, S, and M errors) of all tests was investigated using a Monte Carlo simulation of data explicitly modeled on the re-analyzed dataset. RESULTS:GLS estimates are inconsistent between data sets, and, in each dataset, at least one coefficient is large and highly statistically significant. By contrast, effects estimated by OLS or GEE are very small, especially relative to the standard errors. Bootstrap and permutation GLS distributions suggest that the GLS results in downward biased standard errors and inflated coefficients. The Monte Carlo simulation of error rates shows highly inflated Type I error from the GLS test and slightly inflated Type I error from the GEE test. By contrast, Type I error for all OLS tests are at the nominal level. The permutation F-tests have ?1.9X the power of the other OLS tests. This increased power comes at a cost of high sign error (?10%) if tested on small effects. DISCUSSION:The apparently replicated pattern of well-being effects on gene expression is most parsimoniously explained as "correlated noise" due to the geometry of multiple regression. The GLS for fixed effects with correlated error, or any linear mixed model for estimating fixed effects in designs with many repeated measures or outcomes, should be used cautiously because of the inflated Type I and M error. By contrast, all OLS tests perform well, and the permutation F-tests have superior performance, including moderate power for very small effects.

Project description:Context or problemQuantification of nutrient concentrations in rice grain is essential for evaluating nutrient uptake, use efficiency, and balance to develop fertilizer recommendation guidelines. Accurate estimation of nutrient concentrations without relying on plant laboratory analysis is needed in sub-Saharan Africa (SSA), where farmers do not generally have access to laboratories.Objective or research questionThe objectives are to 1) examine if the concentrations of macro- (N, P, K, Ca, Mg, S) and micronutrients (Fe, Mn, B, Cu) in rice grain can be estimated using agro-ecological zones (AEZ), production systems, soil properties, and mineral fertilizer application (N, P, and K) rates as predictor variables, and 2) to identify if nutrient uptakes estimated by best-fitted models with above variables provide improved prediction of actual nutrient uptakes (predicted nutrient concentrations x grain yield) compared to average-based uptakes (average nutrient concentrations in SSA x grain yield).MethodsCross-sectional data from 998 farmers' fields across 20 countries across 4 AEZs (arid/semi-arid, humid, sub-humid, and highlands) in SSA and 3 different production systems: irrigated lowland, rainfed lowland, and rainfed upland were used to test hypotheses of nutrient concentration being estimable with a set of predictor variables among above-cited factors using linear mixed-effects regression models.ResultsAll 10 nutrients were reasonably predicted [Nakagawa's R2 ranging from 0.27 (Ca) to 0.79 (B), and modeling efficiency ranging from 0.178 (Ca) to 0.584 (B)]. However, only the estimation of K and B concentrations was satisfactory with a modeling efficiency superior to 0.5. The country variable contributed more to the variation of concentrations of these nutrients than AEZ and production systems in our best predictive models. There were greater positive relationships (up to 0.18 of difference in correlation coefficient R) between actual nutrient uptakes and model estimation-based uptakes than those between actual nutrient uptakes and average-based uptakes. Nevertheless, only the estimation of B uptake had significant improvement among all nutrients investigated.ConclusionsOur findings suggest that with the exception of B associated with high model EF and an improved uptake over the average-based uptake, estimates of the macronutrient and micronutrient uptakes in rice grain can be obtained simply by using average concentrations of each nutrient at the regional scale for SSA.ImplicationsFurther investigation of other factors such as the timing of fertilizer applications, rice variety, occurrence of drought periods, and atmospheric CO2 concentration is warranted for improved prediction accuracy of nutrient concentrations.

Dataset Information

Multi-model inference using mixed effects from a linear regression based genetic algorithm.

Publications

Multi-model inference using mixed effects from a linear regression based genetic algorithm.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets