Unknown

Dataset Information

0

UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets.


ABSTRACT: Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce a dimension-reduced K-means clustering strategy to tackle this challenge. We examine the performance and effectiveness of three dimension-reduction algorithms: principal component analysis (PCA), t-distributed stochastic neighbor embedding (t-SNE), and uniform manifold approximation and projection (UMAP). By using four benchmark datasets, we found that UMAP is the best-suited technique due to its stable, reliable, and efficient performance, its ability to improve clustering accuracy, especially for large Jaccard distanced-based datasets, and its superior clustering visualization. The UMAP-assisted K-means clustering enables us to shed light on increasingly large datasets from SARS-CoV-2 genome isolates.

SUBMITTER: Hozumi Y 

PROVIDER: S-EPMC7897976 | biostudies-literature | 2021 Feb

REPOSITORIES: biostudies-literature

altmetric image

Publications

UMAP-assisted K-means clustering of large-scale SARS-CoV-2 mutation datasets.

Hozumi Yuta Y   Wang Rui R   Yin Changchuan C   Wei Guo-Wei GW  

Computers in biology and medicine 20210222


Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has a worldwide devastating effect. Understanding the evolution and transmission of SARS-CoV-2 is of paramount importance for controlling, combating and preventing COVID-19. Due to the rapid growth in both the number of SARS-CoV-2 genome sequences and the number of unique mutations, the phylogenetic analysis of SARS-CoV-2 genome isolates faces an emergent large-data challenge. We introduce  ...[more]

Similar Datasets

| S-EPMC8492016 | biostudies-literature
| S-EPMC2896182 | biostudies-literature
| S-EPMC9016156 | biostudies-literature
| S-EPMC4138177 | biostudies-literature
| S-EPMC2672630 | biostudies-literature
| S-EPMC547898 | biostudies-literature
| S-EPMC3218420 | biostudies-other
| S-EPMC4493645 | biostudies-literature
| S-EPMC7603405 | biostudies-literature