Dataset Information

Soft document clustering using a novel graph covering approach.

ABSTRACT:

Background

In text mining, document clustering describes the efforts to assign unstructured documents to clusters, which in turn usually refer to topics. Clustering is widely used in science for data retrieval and organisation.

Results

In this paper we present and discuss a novel graph-theoretical approach for document clustering and its application on a real-world data set. We will show that the well-known graph partition to stable sets or cliques can be generalized to pseudostable sets or pseudocliques. This allows to perform a soft clustering as well as a hard clustering. The software is freely available on GitHub.

Conclusions

The presented integer linear programming as well as the greedy approach for this NP -complete problem lead to valuable results on random instances and some real-world data for different similarity measures. We could show that PS-Document Clustering is a remarkable approach to document clustering and opens the complete toolbox of graph theory to this field.

SUBMITTER: Dorpinghaus J

PROVIDER: S-EPMC6047369 | biostudies-literature | 2018

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Soft document clustering using a novel graph covering approach.

Dörpinghaus Jens J Schaaf Sebastian S Jacobs Marc M

BioData mining 20180614

<h4>Background</h4>In text mining, document clustering describes the efforts to assign unstructured documents to clusters, which in turn usually refer to topics. Clustering is widely used in science for data retrieval and organisation.<h4>Results</h4>In this paper we present and discuss a novel graph-theoretical approach for document clustering and its application on a real-world data set. We will show that the well-known graph partition to stable sets or cliques can be generalized to pseudostab ...[more]

PMID: 30026812

Similar Datasets

Project description:BackgroundGlucosinolates (GSLs) are plant secondary metabolites that contain nitrogen-containing compounds. They are important in the plant defense system and known to provide protection against cancer in humans. Currently, increasing the amount of data generated from various omics technologies serves as a hotspot for new gene discovery. However, sometimes sequence similarity searching approach is not sufficiently effective to find these genes; hence, we adapted a network clustering approach to search for potential GSLs genes from the Arabidopsis thaliana co-expression dataset.MethodsWe used known GSL genes to construct a comprehensive GSL co-expression network. This network was analyzed with the DPClusOST algorithm using a density of 0.5. 0.6. 0.7, 0.8, and 0.9. Generating clusters were evaluated using Fisher's exact test to identify GSL gene co-expression clusters. A significance score (SScore) was calculated for each gene based on the generated p-value of Fisher's exact test. SScore was used to perform a receiver operating characteristic (ROC) study to classify possible GSL genes using the ROCR package. ROCR was used in determining the AUC that measured the suitable density value of the cluster for further analysis. Finally, pathway enrichment analysis was conducted using ClueGO to identify significant pathways associated with the GSL clusters.ResultsThe density value of 0.8 showed the highest area under the curve (AUC) leading to the selection of thirteen potential GSL genes from the top six significant clusters that include IMDH3, MVP1, T19K24.17, MRSA2, SIR, ASP4, MTO1, At1g21440, HMT3, At3g47420, PS1, SAL1, and At3g14220. A total of Four potential genes (MTO1, SIR, SAL1, and IMDH3) were identified from the pathway enrichment analysis on the significant clusters. These genes are directly related to GSL-associated pathways such as sulfur metabolism and valine, leucine, and isoleucine biosynthesis. This approach demonstrates the ability of the network clustering approach in identifying potential GSL genes which cannot be found from the standard similarity search.

Dataset Information

Soft document clustering using a novel graph covering approach.

Background

Results

Conclusions

Publications

Soft document clustering using a novel graph covering approach.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets