Unknown

Dataset Information

0

Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics.


ABSTRACT: Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data. The implementation of GBHC is available at https://sites.google.com/site/gaussianbhc/

SUBMITTER: Sirinukunwattana K 

PROVIDER: S-EPMC3806770 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

altmetric image

Publications

Bayesian hierarchical clustering for studying cancer gene expression data with unknown statistics.

Sirinukunwattana Korsuk K   Savage Richard S RS   Bari Muhammad F MF   Snead David R J DR   Rajpoot Nasir M NM  

PloS one 20131023 10


Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian comp  ...[more]

Similar Datasets

| S-EPMC2736174 | biostudies-literature
2008-08-30 | GSE12627 | GEO
| S-EPMC8168892 | biostudies-literature
| S-EPMC2799515 | biostudies-literature
| S-EPMC3228548 | biostudies-literature
| S-EPMC2367475 | biostudies-literature
| S-EPMC7268989 | biostudies-literature
| S-EPMC3536024 | biostudies-literature
| S-ECPF-GEOD-12627 | biostudies-other
| S-EPMC3905248 | biostudies-literature