Dataset Information

Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression.

ABSTRACT:

Background

Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study.

Methods

The study used a three-step methodology amalgamating multiple imputation, a machine learning boosted regression algorithm and logistic regression, to identify key biomarkers associated with depression in the National Health and Nutrition Examination Study (2009-2010). Depression was measured using the Patient Health Questionnaire-9 and 67 biomarkers were analysed. Covariates in this study included gender, age, race, smoking, food security, Poverty Income Ratio, Body Mass Index, physical activity, alcohol use, medical conditions and medications. The final imputed weighted multiple logistic regression model included possible confounders and moderators.

Results

After the creation of 20 imputation data sets from multiple chained regression sequences, machine learning boosted regression initially identified 21 biomarkers associated with depression. Using traditional logistic regression methods, including controlling for possible confounders and moderators, a final set of three biomarkers were selected. The final three biomarkers from the novel hybrid variable selection methodology were red cell distribution width (OR 1.15; 95% CI 1.01, 1.30), serum glucose (OR 1.01; 95% CI 1.00, 1.01) and total bilirubin (OR 0.12; 95% CI 0.05, 0.28). Significant interactions were found between total bilirubin with Mexican American/Hispanic group (p = 0.016), and current smokers (p<0.001).

Conclusion

The systematic use of a hybrid methodology for variable selection, fusing data mining techniques using a machine learning algorithm with traditional statistical modelling, accounted for missing data and complex survey sampling methodology and was demonstrated to be a useful tool for detecting three biomarkers associated with depression for future hypothesis generation: red cell distribution width, serum glucose and total bilirubin.

SUBMITTER: Dipnall JF

PROVIDER: S-EPMC4744063 | biostudies-literature | 2016

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression.

Dipnall Joanna F JF Pasco Julie A JA Berk Michael M Williams Lana J LJ Dodd Seetal S Jacka Felice N FN Meyer Denny D

PloS one 20160205 2

<h4>Background</h4>Atheoretical large-scale data mining techniques using machine learning algorithms have promise in the analysis of large epidemiological datasets. This study illustrates the use of a hybrid methodology for variable selection that took account of missing data and complex survey design to identify key biomarkers associated with depression from a large epidemiological study.<h4>Methods</h4>The study used a three-step methodology amalgamating multiple imputation, a machine learning ...[more]

PMID: 26848571

Similar Datasets

Project description:ObjectiveDepression is a mental disorder characterized by persistent feelings of sadness, decreased interest or pleasure in activities and reduced energy. As a highly prevalent disorder, it seriously endangers the psychosocial functioning of patients. Many scholars have conducted clinical studies on the treatment of depression using different herbal remedies, but there are no studies that integrate these remedies to explore the general medication rule. This study aims to explore the medication pattern of Traditional Chinese Medicine (TCM) treatment for depression through data mining methods, so as to provide scientific theoretical basis and reference for clinical treatment and new prescription development.MethodsBased on the PRISMA principle, 121 articles involving 10810 patients with depression of TCM treatment were collected. We then performed frequency, association rule, and hierarchical clustering analysis of Chinese herbs using Microsoft Excel 2016, SPSS Modeler 18.0 and IBM SPSS Statistics 23.ResultsAmong the 270 herbs collected, the three most frequently occurring herbs are Gancao, Chaihu, and Shaoyao. The categories of high-frequency herbs are mainly deficiency-tonifying, Qi-regulating and blood-activating and stasis-eliminating herbs. Through the Apriori algorithm, we mined 21 herbal groups of association rules, and among which the combination of Chaihu-Shaoyao-Gancao has the highest level of support. Furthermore, five novel clustering combinations were identified, predominantly derived from Xiaoyao-San, Chaihu-Shugan-San, Sini powder, Kaixin-San and Chaihu-Jia-Longgu-Muli Decoction.ConclusionThe current study not only concluded the frequent combinations but also developed five new drug cluster combinations for depression, which can provide evidence-based references for the future clinical treatment and is helpful to understand the potential pharmaceutical mechanism from the properties, tastes, meridian tropisms and categories. The clinical effectiveness of these combinations needs to be verified by future study.

Dataset Information

Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression.

Background

Methods

Results

Conclusion

Publications

Fusing Data Mining, Machine Learning and Traditional Statistics to Detect Biomarkers Associated with Depression.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets