Dataset Information

Distance-based clustering challenges for unbiased benchmarking studies.

ABSTRACT: Benchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partition comparison measures can yield the same results for different clustering solutions. Consequently, algorithm selection and parameter optimization by unsupervised quality measures (QM) are always biased and misleading. Only if the predefined structures happen to meet the particular clustering criterion and QM, can the clusters be recovered. Results are presented based on 41 open-source algorithms which are particularly useful in biomedical scenarios. Furthermore, comparative analysis with mirrored density plots provides a significantly more detailed benchmark than that with the typically used box plots or violin plots.

SUBMITTER: Thrun MC

PROVIDER: S-EPMC8460803 | biostudies-literature | 2021 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Distance-based clustering challenges for unbiased benchmarking studies.

Thrun Michael C MC

Scientific reports 20210923 1

Benchmark datasets with predefined cluster structures and high-dimensional biomedical datasets outline the challenges of cluster analysis: clustering algorithms are limited in their clustering ability in the presence of clusters defining distance-based structures resulting in a biased clustering solution. Data sets might not have cluster structures. Clustering yields arbitrary labels and often depends on the trial, leading to varying results. Moreover, recent research indicated that all partitio ...[more]

PMID: 34556686

Dataset Information

Distance-based clustering challenges for unbiased benchmarking studies.

Publications

Distance-based clustering challenges for unbiased benchmarking studies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Geometry-based distance for clustering amino acids.
| S-EPMC9041948 | biostudies-literature

Multidimensional scaling improves distance-based clustering for microbiome data.
| S-EPMC11814494 | biostudies-literature

EvANI benchmarking workflow for evolutionary distance estimation.
| S-EPMC11870633 | biostudies-literature

Challenges in benchmarking metagenomic profilers.
| S-EPMC8184642 | biostudies-literature

Novel trajectory clustering method based on distance dependent Chinese restaurant process.
| S-EPMC7924552 | biostudies-literature

Automated calibration of consensus weighted distance-based clustering approaches using sharp.
| S-EPMC10627366 | biostudies-literature

Sketch distance-based clustering of chromosomes for large genome database compression.
| S-EPMC6939838 | biostudies-literature

An unbiased method to build benchmarking sets for ligand-based virtual screening and its application to GPCRs.
| S-EPMC4038372 | biostudies-literature

Using Hamming Distance as Information for SNP-Sets Clustering and Testing in Disease Association Studies.
| S-EPMC4547758 | biostudies-literature

Identification of adult spinal Shox2 neuronal subpopulations based on unbiased computational clustering of electrophysiological properties.
| S-EPMC9385948 | biostudies-literature