Unknown

Dataset Information

0

A highly efficient multi-core algorithm for clustering extremely large datasets.


ABSTRACT:

Background

In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers. One answer to this problem is to utilize the intrinsic capabilities in current multi-core hardware to distribute the tasks among the different cores of one computer.

Results

We introduce a multi-core parallelization of the k-means and k-modes cluster algorithms based on the design principles of transactional memory for clustering gene expression microarray type data and categorial SNP data. Our new shared memory parallel algorithms show to be highly efficient. We demonstrate their computational power and show their utility in cluster stability and sensitivity analysis employing repeated runs with slightly changed parameters. Computation speed of our Java based algorithm was increased by a factor of 10 for large data sets while preserving computational accuracy compared to single-core implementations and a recently published network based parallelization.

Conclusions

Most desktop computers and even notebooks provide at least dual-core processors. Our multi-core algorithms show that using modern algorithmic concepts, parallelization makes it possible to perform even such laborious tasks as cluster sensitivity and cluster number estimation on the laboratory computer.

SUBMITTER: Kraus JM 

PROVIDER: S-EPMC2865495 | biostudies-literature | 2010 Apr

REPOSITORIES: biostudies-literature

altmetric image

Publications

A highly efficient multi-core algorithm for clustering extremely large datasets.

Kraus Johann M JM   Kestler Hans A HA  

BMC bioinformatics 20100406


<h4>Background</h4>In recent years, the demand for computational power in computational biology has increased due to rapidly growing data sets from microarray and other high-throughput technologies. This demand is likely to increase. Standard algorithms for analyzing data, such as cluster algorithms, need to be parallelized for fast processing. Unfortunately, most approaches for parallelizing algorithms largely rely on network communication protocols connecting and requiring multiple computers.  ...[more]

Similar Datasets

| S-EPMC2672630 | biostudies-literature
| S-EPMC5519076 | biostudies-literature
| S-EPMC8086011 | biostudies-literature
| S-EPMC4681989 | biostudies-literature
| S-EPMC6631606 | biostudies-literature
| S-EPMC2896182 | biostudies-literature
| S-EPMC6019535 | biostudies-literature
| S-EPMC3218420 | biostudies-other
| S-EPMC11326248 | biostudies-literature
| S-EPMC3262844 | biostudies-literature