Dataset Information

Global considerations in hierarchical clustering reveal meaningful patterns in data.

ABSTRACT:

Background

A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied.

Methodology/principal findings

We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algorithms are better suited to reveal meaningful patterns in the data. This is demonstrated, by testing the correspondence between the results of several algorithms (TD, glocal and BU) and the correct annotations provided by experts. The correspondence was tested in multiple domains including gene expression experiments, stock trade records and functional protein families. The performance of each of the algorithms is evaluated by statistical criteria that are assigned to clusters (nodes of the hierarchy tree) based on expert-labeled data. Whereas TD algorithms perform better on global patterns, BU algorithms perform well and are advantageous when finer granularity of the data is sought. In addition, a novel TD algorithm that is based on genuine density of the data points is presented and is shown to outperform other divisive and agglomerative methods. Application of the algorithm to more than 500 protein sequences belonging to ion-channels illustrates the potential of the method for inferring overlooked functional annotations. ClustTree, a graphical Matlab toolbox for applying various hierarchical clustering algorithms and testing their quality is made available.

Conclusions

Although currently rarely used, global approaches, in particular, TD or glocal algorithms, should be considered in the exploratory process of clustering. In general, applying unsupervised clustering methods can leverage the quality of manually-created mapping of proteins families. As demonstrated, it can also provide insights in erroneous and missed annotations.

SUBMITTER: Varshavsky R

PROVIDER: S-EPMC2375056 | biostudies-literature | 2008 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Global considerations in hierarchical clustering reveal meaningful patterns in data.

Varshavsky Roy R Horn David D Linial Michal M

PloS one 20080521 5

<h4>Background</h4>A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied.<h4>Methodology/principal findings</h4>We show that hierarchical clustering that involve global considerations, such as top-down (TD, divisive), or glocal (global-local) algor ...[more]

PMID: 18493326

Dataset Information

Global considerations in hierarchical clustering reveal meaningful patterns in data.

Background

Methodology/principal findings

Conclusions

Publications

Global considerations in hierarchical clustering reveal meaningful patterns in data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Hierarchical clustering of shotgun proteomics data.
| S-EPMC3108832 | biostudies-literature

Non-supervised hierarchical clustering of gene expression data
2008-08-30 | GSE12627 | GEO

Data integration by fuzzy similarity-based hierarchical clustering.
| S-EPMC7446192 | biostudies-literature

R/BHC: fast Bayesian hierarchical clustering for microarray data.
| S-EPMC2736174 | biostudies-literature

Clustering on hierarchical heterogeneous data with prior pairwise relationships.
| S-EPMC10807103 | biostudies-literature

Hierarchical clustering of PDAC cell lines
2022-11-19 | E-MTAB-8173 | biostudies-arrayexpress

CLAG: an unsupervised non hierarchical clustering algorithm handling biological data.
| S-EPMC3519615 | biostudies-literature

clusterMLD: An Efficient Hierarchical Clustering Method for Multivariate Longitudinal Data.
| S-EPMC10584088 | biostudies-literature

Hierarchical clustering of high-throughput expression data based on general dependences.
| S-EPMC3905248 | biostudies-literature

A HIERARCHICAL BAYESIAN MODEL FOR SINGLE-CELL CLUSTERING USING RNA-SEQUENCING DATA.
| S-EPMC8168892 | biostudies-literature