Unknown

Dataset Information

0

MeShClust: an intelligent tool for clustering DNA sequences.


ABSTRACT: Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter is inaccurate. To overcome this limitation, we adapted the mean shift algorithm, an unsupervised machine-learning algorithm, which has been used successfully thousands of times in fields such as image processing and computer vision. The theory behind the mean shift algorithm, unlike the greedy approaches, guarantees convergence to the modes, e.g. cluster centers. Here we describe the first application of the mean shift algorithm to clustering DNA sequences. MeShClust is one of few applications of the mean shift algorithm in bioinformatics. Further, we applied supervised machine learning to predict the identity score produced by global alignment using alignment-free methods. We demonstrate MeShClust's ability to cluster DNA sequences with high accuracy even when the sequence similarity parameter provided by the user is not very accurate.

SUBMITTER: James BT 

PROVIDER: S-EPMC6101578 | biostudies-literature | 2018 Aug

REPOSITORIES: biostudies-literature

altmetric image

Publications

MeShClust: an intelligent tool for clustering DNA sequences.

James Benjamin T BT   Luczak Brian B BB   Girgis Hani Z HZ  

Nucleic acids research 20180801 14


Sequence clustering is a fundamental step in analyzing DNA sequences. Widely-used software tools for sequence clustering utilize greedy approaches that are not guaranteed to produce the best results. These tools are sensitive to one parameter that determines the similarity among sequences in a cluster. Often times, a biologist may not know the exact sequence similarity. Therefore, clusters produced by these tools do not likely match the real clusters comprising the data if the provided parameter  ...[more]

Similar Datasets

| S-EPMC6187223 | biostudies-literature
| S-EPMC8782307 | biostudies-literature
| S-EPMC55324 | biostudies-literature
| S-EPMC2575518 | biostudies-literature
| S-EPMC3041550 | biostudies-literature
| S-EPMC1409676 | biostudies-literature
| S-EPMC4292933 | biostudies-literature
| S-EPMC7088265 | biostudies-literature
| S-EPMC3994846 | biostudies-literature
| S-EPMC6122555 | biostudies-literature