Unknown

Dataset Information

0

Massive fungal biodiversity data re-annotation with multi-level clustering.


ABSTRACT: With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophisticated approaches on sequence clustering are based on pairwise similarity matrices. This is impractical for databases of hundreds of thousands of sequences as such a similarity matrix alone would exceed the available memory. In this paper, a new approach called MultiLevel Clustering (MLC) is proposed which avoids a majority of sequence comparisons, and therefore, significantly reduces the total runtime for clustering. An implementation of the algorithm allowed clustering of all 344,239 ITS (Internal Transcribed Spacer) fungal sequences from GenBank utilizing only a normal desktop computer within 22 CPU-hours whereas the greedy clustering method took up to 242 CPU-hours.

SUBMITTER: Vu D 

PROVIDER: S-EPMC4213798 | biostudies-literature | 2014 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

Massive fungal biodiversity data re-annotation with multi-level clustering.

Vu Duong D   Szöke Szániszló S   Wiwie Christian C   Baumbach Jan J   Cardinali Gianluigi G   Röttger Richard R   Robert Vincent V  

Scientific reports 20141030


With the availability of newer and cheaper sequencing methods, genomic data are being generated at an increasingly fast pace. In spite of the high degree of complexity of currently available search routines, the massive number of sequences available virtually prohibits quick and correct identification of large groups of sequences sharing common traits. Hence, there is a need for clustering tools for automatic knowledge extraction enabling the curation of large-scale databases. Current sophistica  ...[more]

Similar Datasets

| S-EPMC2547102 | biostudies-literature
| S-EPMC7161108 | biostudies-literature
| S-EPMC5411077 | biostudies-literature
| S-EPMC10033899 | biostudies-literature
| S-EPMC7305234 | biostudies-literature
| S-EPMC5421816 | biostudies-literature
| S-EPMC7423957 | biostudies-literature
| S-EPMC8629388 | biostudies-literature
| S-EPMC8415361 | biostudies-literature
| S-EPMC9805570 | biostudies-literature