Unknown

Dataset Information

0

Centroid based clustering of high throughput sequencing reads based on n-mer counts.


ABSTRACT: Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering.We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with or without the data whitening is efficient from the computational point of view. A higher clustering accuracy can be achieved using the soft expectation maximization method, whereby each sequence is attributed to each cluster with a specific probability. We implement an open source tool for alignment-free clustering. It is publicly available from github: https://github.com/luscinius/afcluster.We show the utility of alignment-free sequence clustering for high throughput sequencing analysis despite its limitations. In particular, it allows one to perform assembly with reduced resources and a minimal loss of quality. The major factor affecting performance of alignment-free read clustering is the length of the read.

SUBMITTER: Solovyov A 

PROVIDER: S-EPMC3848435 | biostudies-literature | 2013 Sep

REPOSITORIES: biostudies-literature

altmetric image

Publications

Centroid based clustering of high throughput sequencing reads based on n-mer counts.

Solovyov Alexander A   Lipkin W Ian WI  

BMC bioinformatics 20130908


<h4>Background</h4>Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering.<h4>Results</h4>We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with  ...[more]

Similar Datasets

| S-EPMC6365934 | biostudies-literature
| S-EPMC4895285 | biostudies-literature
| S-EPMC3348557 | biostudies-literature
| S-EPMC6280066 | biostudies-other
| S-EPMC6274891 | biostudies-literature
| S-EPMC5645146 | biostudies-literature
| S-EPMC8383893 | biostudies-literature
| S-EPMC3119603 | biostudies-literature
| S-EPMC4168710 | biostudies-literature
| S-EPMC6022594 | biostudies-literature