Dataset Information

Careful feature selection is key in classification of Alzheimer's disease patients based on whole-genome sequencing data.

ABSTRACT: Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer's disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis of the large number of SNPs possibly associated with the disease onset. To date, a number of such approaches were applied to genotype-based classification of AD patients and healthy controls using GWAS data and reported accuracy of 0.65-0.975. However, since the estimated influence of genotype on sporadic AD occurrence is lower than that, these very high classification accuracies may potentially be a result of overfitting. We have explored the possibilities of applying feature selection and classification using random forests to WGS and GWAS data from two datasets. Our results suggest that this approach is prone to overfitting if feature selection is performed before division of data into the training and testing set. Therefore, we recommend avoiding selection of features used to build the model based on data included in the testing set. We suggest that for currently available dataset sizes the expected classifier performance is between 0.55 and 0.7 (AUC) and higher accuracies reported in literature are likely a result of overfitting.

SUBMITTER: Osipowicz M

PROVIDER: S-EPMC8315124 | biostudies-literature | 2021 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Careful feature selection is key in classification of Alzheimer's disease patients based on whole-genome sequencing data.

Osipowicz Marlena M Wilczynski Bartek B Machnicka Magdalena A MA

NAR genomics and bioinformatics 20210727 3

Despite great increase of the amount of data from genome-wide association studies (GWAS) and whole-genome sequencing (WGS), the genetic background of a partially heritable Alzheimer's disease (AD) is not fully understood yet. Machine learning methods are expected to help researchers in the analysis of the large number of SNPs possibly associated with the disease onset. To date, a number of such approaches were applied to genotype-based classification of AD patients and healthy controls using GWA ...[more]

PMID: 34327330

Dataset Information

Careful feature selection is key in classification of Alzheimer's disease patients based on whole-genome sequencing data.

Publications

Careful feature selection is key in classification of Alzheimer's disease patients based on whole-genome sequencing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Key variants via Alzheimer's Disease Sequencing Project whole genome sequence data.
| S-EPMC10491364 | biostudies-literature

Whole Genome Mapping with Feature Sets from High-Throughput Sequencing Data.
| S-EPMC5017645 | biostudies-literature

Efficient feature selection and classification for microarray data.
| S-EPMC6101392 | biostudies-literature

Antimicrobial resistance genetic factor identification from whole-genome sequence data using deep feature selection.
| S-EPMC6929425 | biostudies-literature

A Systematic Evaluation of Feature Selection and Classification Algorithms Using Simulated and Real miRNA Sequencing Data.
| S-EPMC4609795 | biostudies-literature

Investigation of selection signatures of dairy goats using whole-genome sequencing data.
| S-EPMC11899394 | biostudies-literature

Copy Number Variation Identification on 3,800 Alzheimer's Disease Whole Genome Sequencing Data from the Alzheimer's Disease Sequencing Project.
| S-EPMC8599981 | biostudies-literature

A kernel-based multivariate feature selection method for microarray data classification.
| S-EPMC4105478 | biostudies-literature

Interaction-based feature selection and classification for high-dimensional biological data.
| S-EPMC3577111 | biostudies-literature

Voxel-Wise Feature Selection Method for CNN Binary Classification of Neuroimaging Data.
| S-EPMC8093438 | biostudies-literature