Dataset Information

Clustering and variable selection in the presence of mixed variable types and missing data.

ABSTRACT: We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (1) cluster these patients into similar groups to help identify those with similar clinical presentation and (2) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably with other methods via simulation of problems of this type. The results of the autism spectrum disorder analysis suggested 3 clusters to be most likely, while only 4 test scores had high (>0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values is a common problem in the health sciences as well as in many other disciplines.

SUBMITTER: Storlie CB

PROVIDER: S-EPMC6240391 | biostudies-literature | 2018 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Clustering and variable selection in the presence of mixed variable types and missing data.

Storlie C B CB Myers S M SM Katusic S K SK Weaver A L AL Voigt R G RG Croarkin P E PE Stoeckel R E RE Port J D JD

Statistics in medicine 20180517

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous, and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach, and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the nee ...[more]

PMID: 29774571

Dataset Information

Clustering and variable selection in the presence of mixed variable types and missing data.

Publications

Clustering and variable selection in the presence of mixed variable types and missing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Flexible variable selection in the presence of missing data.
| S-EPMC11323294 | biostudies-literature

Variable selection in the presence of missing data: resampling and imputation.
| S-EPMC5156376 | biostudies-literature

VARIABLE SELECTION FOR REGRESSION MODELS WITH MISSING DATA.
| S-EPMC2844735 | biostudies-literature

PUlasso: High-Dimensional Variable Selection With Presence-Only Data.
| S-EPMC7133715 | biostudies-literature

Imputation-Based Variable Selection Method for Block-Wise Missing Data When Integrating Multiple Longitudinal Studies.
| S-EPMC11804884 | biostudies-literature

Simultaneous clustering and variable selection: A novel algorithm and model selection procedure.
| S-EPMC10439051 | biostudies-literature

VARIABLE SELECTION IN LINEAR MIXED EFFECTS MODELS.
| S-EPMC4026175 | biostudies-literature

Accounting for clustering in automated variable selection using hospital data: a comparison of different LASSO approaches.
| S-EPMC10675967 | biostudies-literature

A spatio-temporal nonparametric Bayesian variable selection model of fMRI data for clustering correlated time courses.
| S-EPMC4076058 | biostudies-literature

Assessing Fairness in the Presence of Missing Data.
| S-EPMC9043798 | biostudies-literature