Dataset Information

Fast approximate hierarchical clustering using similarity heuristics.

ABSTRACT: BACKGROUND:Agglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications. Standard AHC methods require that all pairwise distances between data objects must be known. With ever-increasing data sizes this quadratic complexity poses problems that cannot be overcome by simply waiting for faster computers. RESULTS:We propose an approximate AHC algorithm HappieClust which can output a biologically meaningful clustering of a large dataset more than an order of magnitude faster than full AHC algorithms. The key to the algorithm is to limit the number of calculated pairwise distances to a carefully chosen subset of all possible distances. We choose distances using a similarity heuristic based on a small set of pivot objects. The heuristic efficiently finds pairs of similar objects and these help to mimic the greedy choices of full AHC. Quality of approximate AHC as compared to full AHC is studied with three measures. The first measure evaluates the global quality of the achieved clustering, while the second compares biological relevance using enrichment of biological functions in every subtree of the clusterings. The third measure studies how well the contents of subtrees are conserved between the clusterings. CONCLUSION:The HappieClust algorithm is well suited for large-scale gene expression visualization and analysis both on personal computers as well as public online web applications. The software is available from the URL http://www.quretec.com/HappieClust.

SUBMITTER: Kull M

PROVIDER: S-EPMC2561018 | biostudies-literature | 2008 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Fast approximate hierarchical clustering using similarity heuristics.

Kull Meelis M Vilo Jaak J

BioData mining 20080922 1

<h4>Background</h4>Agglomerative hierarchical clustering (AHC) is a common unsupervised data analysis technique used in several biological applications. Standard AHC methods require that all pairwise distances between data objects must be known. With ever-increasing data sizes this quadratic complexity poses problems that cannot be overcome by simply waiting for faster computers.<h4>Results</h4>We propose an approximate AHC algorithm HappieClust which can output a biologically meaningful cluster ...[more]

PMID: 18822115

Dataset Information

Fast approximate hierarchical clustering using similarity heuristics.

Publications

Fast approximate hierarchical clustering using similarity heuristics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Data integration by fuzzy similarity-based hierarchical clustering.
| S-EPMC7446192 | biostudies-literature

Fast tree aggregation for consensus hierarchical clustering.
| S-EPMC7085155 | biostudies-literature

AnatomiCuts: Hierarchical clustering of tractography streamlines based on anatomical similarity.
| S-EPMC6152885 | biostudies-literature

R/BHC: fast Bayesian hierarchical clustering for microarray data.
| S-EPMC2736174 | biostudies-literature

Fast R Functions for Robust Correlations and Hierarchical Clustering.
| S-EPMC3465711 | biostudies-literature

Similarity maps and hierarchical clustering for annotating FT-IR spectral images.
| S-EPMC4225570 | biostudies-literature

Ultra-Fast Approximate Inference Using Variational Functional Mixed Models.
| S-EPMC10441618 | biostudies-literature

Adjacency-constrained hierarchical clustering of a band similarity matrix with application to genomics.
| S-EPMC6857244 | biostudies-literature

Fast structure similarity searches among protein models: efficient clustering of protein fragments.
| S-EPMC3403935 | biostudies-literature

Detecting concerted demographic response across community assemblages using hierarchical approximate Bayesian computation.
| S-EPMC4137712 | biostudies-literature