Dataset Information

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

ABSTRACT: As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data analysis using Apache Spark by changing the paradigm of the conventional clustering algorithm to improve its efficiency in the Apache Spark environment. We also apply optimization approaches such as Bloom filter and shuffle selection to reduce memory usage and execution time. By evaluating our proposed algorithm based on an average normalized cut, we confirmed that the algorithm can analyze diverse large-scale network datasets such as biological, co-authorship, internet topology and social networks. Experimental results show that the proposed algorithm can develop more accurate clusters than comparative algorithms with less memory usage. Furthermore, we confirm the proposed optimization approaches and the scalability of the proposed algorithm. In addition, we validate that clusters found from the proposed algorithm can represent biologically meaningful functions.

SUBMITTER: Kim J

PROVIDER: S-EPMC6179193 | biostudies-literature | 2018

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

Kim Jungrim J Shin Mincheol M Kim Jeongwoo J Park Chihyun C Lee Sujin S Woo Jaemin J Kim Hyerim H Seo Dongmin D Yu Seokjong S Park Sanghyun S

PloS one 20181010 10

As the size of networks increases, it is becoming important to analyze large-scale network data. A network clustering algorithm is useful for analysis of network data. Conventional network clustering algorithms in a single machine environment rather than a parallel machine environment are actively being researched. However, these algorithms cannot analyze large-scale network data because of memory size issues. As a solution, we propose a network clustering algorithm for large-scale network data ...[more]

PMID: 30303961

Dataset Information

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

Publications

CASS: A distributed network clustering algorithm based on structure similarity for large-scale network.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

cOSPREY: A Cloud-Based Distributed Algorithm for Large-Scale Computational Protein Design.
| S-EPMC5586165 | biostudies-literature

Algorithm for large-scale clustering across multiple genomes.
| S-EPMC3218420 | biostudies-other

Parallel clustering algorithm for large-scale biological data sets.
| S-EPMC3976248 | biostudies-literature

SProt: sphere-based protein structure similarity algorithm.
| S-EPMC3289081 | biostudies-literature

Fuzzy-Logic Based Distributed Energy-Efficient Clustering Algorithm for Wireless Sensor Networks.
| S-EPMC5539863 | biostudies-other

Algorithm-based large-scale screening for blood cancer
| S-BSST207 | biostudies-other

Large-Scale Recurrent Neural Network Based Modelling of Gene Regulatory Network Using Cuckoo Search-Flower Pollination Algorithm.
| S-EPMC4771889 | biostudies-literature

A fast topological analysis algorithm for large-scale similarity evaluations of ligands and binding pockets.
| S-EPMC4631714 | biostudies-literature

ClueNet: Clustering a temporal network based on topological similarity rather than denseness.
| S-EPMC5940177 | biostudies-literature

HipMCL: a high-performance parallel implementation of the Markov clustering algorithm for large-scale networks.
| S-EPMC5888241 | biostudies-literature