Dataset Information

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.

ABSTRACT:

Background

We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster, a new cluster is created.

Results

The performance of the proposed algorithm is validated via comparison to the popular DNA/RNA sequence clustering approach, CD-HIT-EST, and to the recently developed algorithm, UCLUST, using two different sets of 16S rDNA sequences from 2,255 genera. The proposed algorithm maintains a comparable CPU execution time with that of CD-HIT-EST which is much slower than UCLUST, and has successfully generated clusters with higher statistical accuracy than both CD-HIT-EST and UCLUST. The validation results are especially striking for large datasets.

Conclusions

We introduce a fast and accurate clustering algorithm that relies on a grammar-based sequence distance. Its statistical clustering quality is validated by clustering large datasets containing 16S rDNA sequences.

SUBMITTER: Russell DJ

PROVIDER: S-EPMC3022630 | biostudies-literature | 2010 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.

Russell David J DJ Way Samuel F SF Benson Andrew K AK Sayood Khalid K

BMC bioinformatics 20101217

<h4>Background</h4>We propose a sequence clustering algorithm and compare the partition quality and execution time of the proposed algorithm with those of a popular existing algorithm. The proposed clustering algorithm uses a grammar-based distance metric to determine partitioning for a set of biological sequences. The algorithm performs clustering in which new sequences are compared with cluster-representative sequences to determine membership. If comparison fails to identify a suitable cluster ...[more]

PMID: 21167044

Dataset Information

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.

Background

Results

Conclusions

Publications

A grammar-based distance metric enables fast and accurate clustering of large sets of 16S sequences.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

SCRAPT: an iterative algorithm for clustering large 16S rRNA gene data sets.
| S-EPMC10164572 | biostudies-literature

HPC-CLUST: distributed hierarchical clustering for large sets of nucleotide sequences.
| S-EPMC3892691 | biostudies-literature

A comparison of methods for clustering 16S rRNA sequences into OTUs.
| S-EPMC3742672 | biostudies-literature

Deep Learning Enables Fast and Accurate Imputation of Gene Expression.
| S-EPMC8076954 | biostudies-literature

Fast and Accurate Calculation of Protein Depth by Euclidean Distance Transform.
| S-EPMC4098708 | biostudies-literature

Accurate and fast graph-based pangenome annotation and clustering with ggCaller.
| S-EPMC10620059 | biostudies-literature

Bartender: a fast and accurate clustering algorithm to count barcode reads.
| S-EPMC6049041 | biostudies-literature

Some new sets of sequences of fuzzy numbers with respect to the partial metric.
| S-EPMC4324933 | biostudies-other

UPP2: fast and accurate alignment of datasets with fragmentary sequences.
| S-EPMC9846425 | biostudies-literature

Fast and accurate taxonomic assignments of metagenomic sequences using MetaBin.
| S-EPMC3319535 | biostudies-literature