Unknown

Dataset Information

0

Large scale microbiome profiling in the cloud.


ABSTRACT: MOTIVATION:Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources. RESULTS:We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon's Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40?K genomes on 64 machines in 67?s-an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. AVAILABILITY AND IMPLEMENTATION:Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.

SUBMITTER: Valdes C 

PROVIDER: S-EPMC6612844 | biostudies-literature | 2019 Jul

REPOSITORIES: biostudies-literature

altmetric image

Publications

Large scale microbiome profiling in the cloud.

Valdes Camilo C   Stebliankin Vitalii V   Narasimhan Giri G  

Bioinformatics (Oxford, England) 20190701 14


<h4>Motivation</h4>Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset.  ...[more]

Similar Datasets

| S-EPMC5925781 | biostudies-literature
| S-EPMC8143397 | biostudies-literature
| S-EPMC5586165 | biostudies-literature
| S-EPMC7207147 | biostudies-literature
| S-EPMC7886284 | biostudies-literature
| S-EPMC7299438 | biostudies-literature
| S-EPMC6290780 | biostudies-literature
| S-EPMC2775068 | biostudies-literature
| S-EPMC1950539 | biostudies-literature
| S-EPMC1524917 | biostudies-literature