Dataset Information

PFClust: a novel parameter free clustering algorithm.

ABSTRACT:

Background

We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of 'correct' cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simply-expressed metric over the space of possible clusterings.

Results

We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies - even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH.

Conclusions

We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH part-manually curated classification of protein domain structures.

SUBMITTER: Mavridis L

PROVIDER: S-EPMC3747858 | biostudies-literature | 2013 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

PFClust: a novel parameter free clustering algorithm.

Mavridis Lazaros L Nath Neetika N Mitchell John B O JB

BMC bioinformatics 20130703

<h4>Background</h4>We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intra-cluster similarity. A set of n objects can be clustered into any number of clusters from one ...[more]

PMID: 23819480

Similar Datasets

Project description:Single-molecule localization microscopy (SMLM) enables the production of high-resolution images by imaging spatially isolated fluorescent particles. Although challenging, the result of SMLM analysis lists the position of individual molecules, leading to a valuable quantification of the stoichiometry and spatial organization of molecular actors. Both the signal/noise ratio and the density (Dframe), i.e., the number of fluorescent particles per ?m2 per frame, have previously been identified as determining factors for reaching a given SMLM precision. Establishing a comprehensive theoretical study relying on these two parameters is therefore of central interest to delineate the achievable limits for accurate SMLM observations. Our study reports that in absence of prior knowledge of the signal intensity ?, the density effect on particle localization is more prominent than that anticipated from theoretical studies performed at known ?. A first limit appears when, under a low-density hypothesis (i.e., one-Gaussian fitting hypothesis), any fluorescent particle distant by less than ?600 nm from the particle of interest biases its localization. In fact, all particles should be accounted for, even those dimly fluorescent, to ascertain unbiased localization of any surrounding particles. Moreover, even under a high-density hypothesis (i.e., multi-Gaussian fitting hypothesis), a second limit arises because of the impossible distinction of particles located too closely. An increase in Dframe is thus likely to deteriorate the localization precision, the image reconstruction, and more generally the quantification accuracy. Our study firstly provides a density-signal/noise ratio space diagram for use as a guide in data recording toward reaching an achievable SMLM resolution. Additionally, it leads to the identification of the essential requirements for implementing UNLOC, a parameter-free and fast computing algorithm approaching the Cramér-Rao bound for particles at high-density per frame and without any prior knowledge of their intensity. UNLOC is available as an ImageJ plugin.

Dataset Information

PFClust: a novel parameter free clustering algorithm.

Background

Results

Conclusions

Publications

PFClust: a novel parameter free clustering algorithm.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets