Dataset Information

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

ABSTRACT: Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.There has been a recent surge of interest in using compressed sensing inspired and convex-optimization based methods to solve the estimation problem for bacterial community composition. These methods typically rely on summarizing the sequence data by frequencies of low-order k-mers and matching this information statistically with a taxonomically structured database. Here we show that the accuracy of the resulting community composition estimates can be substantially improved by aggregating the reads from a sample with an unsupervised machine learning approach prior to the estimation phase. The aggregation of reads is a pre-processing approach where we use a standard K-means clustering algorithm that partitions a large set of reads into subsets with reasonable computational cost to provide several vectors of first order statistics instead of only single statistical summarization in terms of k-mer frequencies. The output of the clustering is then processed further to obtain the final estimate for each sample. The resulting method is called Aggregation of Reads by K-means (ARK), and it is based on a statistical argument via mixture density formulation. ARK is found to improve the fidelity and robustness of several recently introduced methods, with only a modest increase in computational complexity.An open source, platform-independent implementation of the method in the Julia programming language is freely available at https://github.com/dkoslicki/ARK. A Matlab implementation is available at http://www.ee.kth.se/ctsoftware.

SUBMITTER: Koslicki D

PROVIDER: S-EPMC4619776 | biostudies-literature | 2015

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

Koslicki David D Chatterjee Saikat S Shahrivar Damon D Walker Alan W AW Francis Suzanna C SC Fraser Louise J LJ Vehkaperä Mikko M Lan Yueheng Y Corander Jukka J

PloS one 20151023 10

<h4>Motivation</h4>Estimation of bacterial community composition from high-throughput sequenced 16S rRNA gene amplicons is a key task in microbial ecology. Since the sequence data from each sample typically consist of a large number of reads and are adversely impacted by different levels of biological and technical noise, accurate analysis of such large datasets is challenging.<h4>Results</h4>There has been a recent surge of interest in using compressed sensing inspired and convex-optimization b ...[more]

PMID: 26496191

Dataset Information

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

Publications

ARK: Aggregation of Reads by K-Means for Estimation of Bacterial Community Composition.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

ARK: Aggregate of Reads by K-Means for Estimation of Bacterial Community Composition
| PRJEB9828 | ENA

Bayesian estimation of bacterial community composition from 454 sequencing data.
| S-EPMC3384343 | biostudies-literature

bacterial community composition
| PRJNA1065110 | ENA

Allele specific expression and bacterial community composition in the Drosophila gut
2024-07-31 | GSE263264 | GEO

bacterial community Raw sequence reads
| PRJNA530507 | ENA

Bacterial community composition (16S rRNA)
| PRJEB73275 | ENA

bacterial diversity and community composition
| PRJEB19195 | ENA

Bacterial community composition in Panchagavya
| PRJEB43256 | ENA

Haplotype estimation using sequencing reads.
| S-EPMC3791270 | biostudies-literature

bacterial community composition of Hyalesthes obsoletus
| PRJEB13010 | ENA