Unknown

Dataset Information

0

The Amordad database engine for metagenomics.


ABSTRACT: Several technical challenges in metagenomic data analysis, including assembling metagenomic sequence data or identifying operational taxonomic units, are both significant and well known. These forms of analysis are increasingly cited as conceptually flawed, given the extreme variation within traditionally defined species and rampant horizontal gene transfer. Furthermore, computational requirements of such analysis have hindered content-based organization of metagenomic data at large scale.In this article, we introduce the Amordad database engine for alignment-free, content-based indexing of metagenomic datasets. Amordad places the metagenome comparison problem in a geometric context, and uses an indexing strategy that combines random hashing with a regular nearest neighbor graph. This framework allows refinement of the database over time by continual application of random hash functions, with the effect of each hash function encoded in the nearest neighbor graph. This eliminates the need to explicitly maintain the hash functions in order for query efficiency to benefit from the accumulated randomness. Results on real and simulated data show that Amordad can support logarithmic query time for identifying similar metagenomes even as the database size reaches into the millions.Source code, licensed under the GNU general public license (version 3) is freely available for download from http://smithlabresearch.org/amordadandrewds@usc.eduSupplementary data are available at Bioinformatics online.

SUBMITTER: Behnam E 

PROVIDER: S-EPMC4184256 | biostudies-literature | 2014 Oct

REPOSITORIES: biostudies-literature

altmetric image

Publications

The Amordad database engine for metagenomics.

Behnam Ehsan E   Smith Andrew D AD  

Bioinformatics (Oxford, England) 20140627 20


<h4>Motivation</h4>Several technical challenges in metagenomic data analysis, including assembling metagenomic sequence data or identifying operational taxonomic units, are both significant and well known. These forms of analysis are increasingly cited as conceptually flawed, given the extreme variation within traditionally defined species and rampant horizontal gene transfer. Furthermore, computational requirements of such analysis have hindered content-based organization of metagenomic data at  ...[more]

Similar Datasets

| S-EPMC6379032 | biostudies-literature
| S-EPMC1780044 | biostudies-literature
| S-EPMC2808862 | biostudies-literature
| S-EPMC2665244 | biostudies-literature
| S-EPMC8720894 | biostudies-literature
| S-EPMC3679982 | biostudies-literature
| S-EPMC2238875 | biostudies-literature
| S-EPMC11165154 | biostudies-literature
| S-EPMC4907353 | biostudies-literature
| S-EPMC4531314 | biostudies-other