Dataset Information

A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies.

ABSTRACT:

Background

Recent literature on the comparison of machine learning methods has raised questions about the neutrality, unbiasedness and utility of many comparative studies. Reporting of results on favourable datasets and sampling error in the estimated performance measures based on single samples are thought to be the major sources of bias in such comparisons. Better performance in one or a few instances does not necessarily imply so on an average or on a population level and simulation studies may be a better alternative for objectively comparing the performances of machine learning algorithms.

Methods

We compare the classification performance of a number of important and widely used machine learning algorithms, namely the Random Forests (RF), Support Vector Machines (SVM), Linear Discriminant Analysis (LDA) and k-Nearest Neighbour (kNN). Using massively parallel processing on high-performance supercomputers, we compare the generalisation errors at various combinations of levels of several factors: number of features, training sample size, biological variation, experimental variation, effect size, replication and correlation between features.

Results

For smaller number of correlated features, number of features not exceeding approximately half the sample size, LDA was found to be the method of choice in terms of average generalisation errors as well as stability (precision) of error estimates. SVM (with RBF kernel) outperforms LDA as well as RF and kNN by a clear margin as the feature set gets larger provided the sample size is not too small (at least 20). The performance of kNN also improves as the number of features grows and outplays that of LDA and RF unless the data variability is too high and/or effect sizes are too small. RF was found to outperform only kNN in some instances where the data are more variable and have smaller effect sizes, in which cases it also provide more stable error estimates than kNN and LDA. Applications to a number of real datasets supported the findings from the simulation study.

SUBMITTER: Khondoker M

PROVIDER: S-EPMC5081132 | biostudies-literature | 2016 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies.

Khondoker Mizanur M Dobson Richard R Skirrow Caroline C Simmons Andrew A Stahl Daniel D

Statistical methods in medical research 20130918 5

<h4>Background</h4>Recent literature on the comparison of machine learning methods has raised questions about the neutrality, unbiasedness and utility of many comparative studies. Reporting of results on favourable datasets and sampling error in the estimated performance measures based on single samples are thought to be the major sources of bias in such comparisons. Better performance in one or a few instances does not necessarily imply so on an average or on a population level and simulation s ...[more]

PMID: 24047600

Dataset Information

A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies.

Background

Methods

Results

Publications

A comparison of machine learning methods for classification using simulation with multiple real data examples from mental health studies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Comparison of Proteomic Assessment Methods in Multiple Cohort Studies.
| S-EPMC7425176 | biostudies-literature

Misclassification Errors in Unsupervised Classification Methods. Comparison Based on the Simulation of Targeted Proteomics Data.
| S-EPMC4982549 | biostudies-literature

Statistical Methods for Unusual Count Data: Examples From Studies of Microchimerism.
| S-EPMC5141948 | biostudies-literature

Comparison of machine learning methods for estimating case fatality ratios: An Ebola outbreak simulation study.
| S-EPMC8443081 | biostudies-literature

A comparison of multiple imputation methods for missing data in longitudinal studies.
| S-EPMC6292063 | biostudies-literature

Detecting Causality by Combined Use of Multiple Methods: Climate and Brain Examples.
| S-EPMC4933387 | biostudies-literature

Brain simulation augments machine-learning-based classification of dementia.
| S-EPMC9107774 | biostudies-literature

A comparison of gene region simulation methods.
| S-EPMC3399793 | biostudies-literature

Using simulation studies to evaluate statistical methods.
| S-EPMC6492164 | biostudies-literature

Real Time Influenza Monitoring Using Hospital Big Data in Combination with Machine Learning Methods: Comparison Study.
| S-EPMC6320394 | biostudies-literature