Unknown

Dataset Information

0

Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections.


ABSTRACT:

Background

Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome.

Methods

We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods.

Results

Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO.

Conclusions

Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets.

SUBMITTER: Mansiaux Y 

PROVIDER: S-EPMC4146451 | biostudies-literature | 2014 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections.

Mansiaux Yohann Y   Carrat Fabrice F  

BMC medical research methodology 20140826


<h4>Background</h4>Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome.<h4>Methods</h4>We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BR  ...[more]

Similar Datasets

| S-EPMC2732298 | biostudies-literature
| S-EPMC2567351 | biostudies-literature
| S-EPMC3842118 | biostudies-other
| S-EPMC8674730 | biostudies-literature
| S-EPMC6527211 | biostudies-literature
| S-EPMC10105299 | biostudies-literature
| S-EPMC4743660 | biostudies-literature
| S-EPMC7763457 | biostudies-literature
| S-EPMC7716883 | biostudies-literature
| S-EPMC2367457 | biostudies-literature