Dataset Information

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

ABSTRACT: The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Forest for parallel hierarchical clustering of sequences. The algorithm achieves subquadratic time and space complexity and maintains a high clustering accuracy comparable to the standard method. The basic idea is to organize sequences into a pseudo-metric based partitioning tree for sub-linear time searching of nearest neighbors, and then use a new multiple-pair merging criterion to construct clusters in parallel using multiple threads. The new algorithm was tested on the human microbiome project (HMP) dataset, currently one of the largest published microbial 16S rRNA sequence dataset. Our experiment demonstrated that with the power of parallel computing it is now compu- tationally feasible to perform hierarchical clustering analysis of tens of millions of sequences. The software is available at http://www.acsu.buffalo.edu/∼yijunsun/lab/ESPRIT-Forest.html.

SUBMITTER: Cai Y

PROVIDER: S-EPMC5421816 | biostudies-literature | 2017 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

Cai Yunpeng Y Zheng Wei W Yao Jin J Yang Yujie Y Mai Volker V Mao Qi Q Sun Yijun Y

PLoS computational biology 20170424 4

The rapid development of sequencing technology has led to an explosive accumulation of genomic sequence data. Clustering is often the first step to perform in sequence analysis, and hierarchical clustering is one of the most commonly used approaches for this purpose. However, it is currently computationally expensive to perform hierarchical clustering of extremely large sequence datasets due to its quadratic time and space complexities. In this paper we developed a new algorithm called ESPRIT-Fo ...[more]

PMID: 28437450

Dataset Information

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

Publications

ESPRIT-Forest: Parallel clustering of massive amplicon sequence data in subquadratic time.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

ESPRIT-Tree: hierarchical clustering analysis of millions of 16S rRNA pyrosequences in quasilinear computational time.
| S-EPMC3152367 | biostudies-literature

Massive data clustering by multi-scale psychological observations.
| S-EPMC8889001 | biostudies-literature

Massive fungal biodiversity data re-annotation with multi-level clustering.
| S-EPMC4213798 | biostudies-literature

Parallel clustering algorithm for large-scale biological data sets.
| S-EPMC3976248 | biostudies-literature

Gclust: A Parallel Clustering Tool for Microbial Genomic Data.
| S-EPMC7056916 | biostudies-literature

A parallel computational framework for ultra-large-scale sequence clustering analysis.
| S-EPMC6931356 | biostudies-literature

Algorithm for post-clustering curation of DNA amplicon data yields reliable biodiversity estimates.
| S-EPMC5662604 | biostudies-literature

Clustering huge protein sequence sets in linear time.
| S-EPMC6026198 | biostudies-literature

Characterization of MazF-Mediated Sequence-Specific RNA Cleavage in Pseudomonas putida Using Massive Parallel Sequencing.
| S-EPMC4757574 | biostudies-literature

The origin of biased sequence depth in sequence-independent nucleic acid amplification and optimization for efficient massive parallel sequencing.
| S-EPMC3784409 | biostudies-literature