Dataset Information

MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering.

ABSTRACT: BACKGROUND:Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance. RESULTS:We present a cluster-number-based ensemble clustering algorithm, called MULTI-K, for microarray sample classification, which demonstrates remarkable accuracy. The method amalgamates multiple k-means runs by varying the number of clusters and identifies clusters that manifest the most robust co-memberships of elements. In addition to the original algorithm, we newly devised the entropy-plot to control the separation of singletons or small clusters. MULTI-K, unlike the simple k-means or other widely used methods, was able to capture clusters with complex and high-dimensional structures accurately. MULTI-K outperformed other methods including a recently developed ensemble clustering algorithm in tests with five simulated and eight real gene-expression data sets. CONCLUSION:The geometric complexity of clusters should be taken into account for accurate classification of microarray data, and ensemble clustering applied to the number of clusters tackles the problem very well. The C++ code and the data sets tested are available from the authors.

SUBMITTER: Kim EY

PROVIDER: S-EPMC2743671 | biostudies-literature | 2009 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering.

Kim Eun-Youn EY Kim Seon-Young SY Ashlock Daniel D Nam Dougu D

BMC bioinformatics 20090822

<h4>Background</h4>Uncovering subtypes of disease from microarray samples has important clinical implications such as survival time and sensitivity of individual patients to specific therapies. Unsupervised clustering methods have been used to classify this type of data. However, most existing methods focus on clusters with compact shapes and do not reflect the geometric complexity of the high dimensional microarray clusters, which limits their performance.<h4>Results</h4>We present a cluster-nu ...[more]

PMID: 19698124

Similar Datasets

Project description:BackgroundArterial hypertension is a major cardiovascular risk factor. Identification of secondary hypertension in its various forms is key to preventing and targeting treatment of cardiovascular complications. Simplified diagnostic tests are urgently required to distinguish primary and secondary hypertension to address the current underdiagnosis of the latter.MethodsThis study uses Machine Learning (ML) to classify subtypes of endocrine hypertension (EHT) in a large cohort of hypertensive patients using multidimensional omics analysis of plasma and urine samples. We measured 409 multi-omics (MOmics) features including plasma miRNAs (PmiRNA: 173), plasma catechol O-methylated metabolites (PMetas: 4), plasma steroids (PSteroids: 16), urinary steroid metabolites (USteroids: 27), and plasma small metabolites (PSmallMB: 189) in primary hypertension (PHT) patients, EHT patients with either primary aldosteronism (PA), pheochromocytoma/functional paraganglioma (PPGL) or Cushing syndrome (CS) and normotensive volunteers (NV). Biomarker discovery involved selection of disease combination, outlier handling, feature reduction, 8 ML classifiers, class balancing and consideration of different age- and sex-based scenarios. Classifications were evaluated using balanced accuracy, sensitivity, specificity, AUC, F1, and Kappa score.FindingsComplete clinical and biological datasets were generated from 307 subjects (PA=113, PPGL=88, CS=41 and PHT=112). The random forest classifier provided ∼92% balanced accuracy (∼11% improvement on the best mono-omics classifier), with 96% specificity and 0.95 AUC to distinguish one of the four conditions in multi-class ALL-ALL comparisons (PPGL vs PA vs CS vs PHT) on an unseen test set, using 57 MOmics features. For discrimination of EHT (PA + PPGL + CS) vs PHT, the simple logistic classifier achieved 0.96 AUC with 90% sensitivity, and ∼86% specificity, using 37 MOmics features. One PmiRNA (hsa-miR-15a-5p) and two PSmallMB (C9 and PC ae C38:1) features were found to be most discriminating for all disease combinations. Overall, the MOmics-based classifiers were able to provide better classification performance in comparison to mono-omics classifiers.InterpretationWe have developed a ML pipeline to distinguish different EHT subtypes from PHT using multi-omics data. This innovative approach to stratification is an advancement towards the development of a diagnostic tool for EHT patients, significantly increasing testing throughput and accelerating administration of appropriate treatment.FundingEuropean Union's Horizon 2020 Research and Innovation Programme under Grant Agreement No. 633983, Clinical Research Priority Program of the University of Zurich for the CRPP HYRENE (to Z.E. and F.B.), and Deutsche Forschungsgemeinschaft (CRC/Transregio 205/1).

Dataset Information

MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering.

Publications

MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets