Retro: concept-based clustering of biomedical topical sets.
Ontology highlight
ABSTRACT: MOTIVATION:Clustering methods can be useful for automatically grouping documents into meaningful clusters, improving human comprehension of a document collection. Although there are clustering algorithms that can achieve the goal for relatively large document collections, they do not always work well for small and homogenous datasets. METHODS:In this article, we present Retro-a novel clustering algorithm that extracts meaningful clusters along with concise and descriptive titles from small and homogenous document collections. Unlike common clustering approaches, our algorithm predicts cluster titles before clustering. It relies on the hypergeometric distribution model to discover key phrases, and generates candidate clusters by assigning documents to these phrases. Further, the statistical significance of candidate clusters is tested using supervised learning methods, and a multiple testing correction technique is used to control the overall quality of clustering. RESULTS:We test our system on five disease datasets from OMIM(®) and evaluate the results based on MeSH(®) term assignments. We further compare our method with several baseline and state-of-the-art methods, including K-means, expectation maximization, latent Dirichlet allocation-based clustering, Lingo, OPTIMSRC and adapted GK-means. The experimental results on the 20-Newsgroup and ODP-239 collections demonstrate that our method is successful at extracting significant clusters and is superior to existing methods in terms of quality of clusters. Finally, we apply our system to a collection of 6248 topical sets from the HomoloGene(®) database, a resource in PubMed(®). Empirical evaluation confirms the method is useful for small homogenous datasets in producing meaningful clusters with descriptive titles. AVAILABILITY AND IMPLEMENTATION:A web-based demonstration of the algorithm applied to a collection of sets from the HomoloGene database is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Wilbur/IRET/CLUSTERING_HOMOLOGENE/index.html. CONTACT:lana.yeganova@nih.gov SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.
SUBMITTER: Yeganova L
PROVIDER: S-EPMC4221121 | biostudies-literature | 2014 Nov
REPOSITORIES: biostudies-literature
ACCESS DATA