Dataset Information

Centroid based clustering of high throughput sequencing reads based on n-mer counts.

ABSTRACT:

Background

Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering.

Results

We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with or without the data whitening is efficient from the computational point of view. A higher clustering accuracy can be achieved using the soft expectation maximization method, whereby each sequence is attributed to each cluster with a specific probability. We implement an open source tool for alignment-free clustering. It is publicly available from github: https://github.com/luscinius/afcluster.

Conclusions

We show the utility of alignment-free sequence clustering for high throughput sequencing analysis despite its limitations. In particular, it allows one to perform assembly with reduced resources and a minimal loss of quality. The major factor affecting performance of alignment-free read clustering is the length of the read.

SUBMITTER: Solovyov A

PROVIDER: S-EPMC3848435 | biostudies-literature | 2013 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Centroid based clustering of high throughput sequencing reads based on n-mer counts.

Solovyov Alexander A Lipkin W Ian WI

BMC bioinformatics 20130908

<h4>Background</h4>Many problems in computational biology require alignment-free sequence comparisons. One of the common tasks involving sequence comparison is sequence clustering. Here we apply methods of alignment-free comparison (in particular, comparison using sequence composition) to the challenge of sequence clustering.<h4>Results</h4>We study several centroid based algorithms for clustering sequences based on word counts. Study of their performance shows that using k-means algorithm with ...[more]

PMID: 24011402

Similar Datasets

Project description:MotivationB cells derive their antigen-specificity through the expression of Immunoglobulin (Ig) receptors on their surface. These receptors are initially generated stochastically by somatic re-arrangement of the DNA and further diversified following antigen-activation by a process of somatic hypermutation, which introduces mainly point substitutions into the receptor DNA at a high rate. Recent advances in next-generation sequencing have enabled large-scale profiling of the B cell Ig repertoire from blood and tissue samples. A key computational challenge in the analysis of these data is partitioning the sequences to identify descendants of a common B cell (i.e. a clone). Current methods group sequences using a fixed distance threshold, or a likelihood calculation that is computationally-intensive. Here, we propose a new method based on spectral clustering with an adaptive threshold to determine the local sequence neighborhood. Validation using simulated and experimental datasets demonstrates that this method has high sensitivity and specificity compared to a fixed threshold that is optimized for these measures. In addition, this method works on datasets where choosing an optimal fixed threshold is difficult and is more computationally efficient in all cases. The ability to quickly and accurately identify members of a clone from repertoire sequencing data will greatly improve downstream analyses. Clonally-related sequences cannot be treated independently in statistical models, and clonal partitions are used as the basis for the calculation of diversity metrics, lineage reconstruction and selection analysis. Thus, the spectral clustering-based method here represents an important contribution to repertoire analysis.Availability and implementationSource code for this method is freely available in the SCOPe (Spectral Clustering for clOne Partitioning) R package in the Immcantation framework: www.immcantation.org under the CC BY-SA 4.0 license.Supplementary informationSupplementary data are available at Bioinformatics online.

Dataset Information

Centroid based clustering of high throughput sequencing reads based on n-mer counts.

Background

Results

Conclusions

Publications

Centroid based clustering of high throughput sequencing reads based on n-mer counts.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets