Dataset Information

Large scale microbiome profiling in the cloud.

ABSTRACT:

Motivation

Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources.

Results

We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon's Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40 K genomes on 64 machines in 67 s-an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments.

Availability and implementation

Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Valdes C

PROVIDER: S-EPMC6612844 | biostudies-literature | 2019 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Large scale microbiome profiling in the cloud.

Valdes Camilo C Stebliankin Vitalii V Narasimhan Giri G

Bioinformatics (Oxford, England) 20190701 14

<h4>Motivation</h4>Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. ...[more]

PMID: 31510682

Dataset Information

Large scale microbiome profiling in the cloud.

Motivation

Results

Availability and implementation

Supplementary information

Publications

Large scale microbiome profiling in the cloud.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

warpDOCK: Large-Scale Virtual Drug Discovery Using Cloud Infrastructure.
| S-EPMC10433467 | biostudies-literature

Analyzing large scale genomic data on the cloud with Sparkhit.
| S-EPMC5925781 | biostudies-literature

Swarm: A federated cloud framework for large-scale variant analysis.
| S-EPMC8143397 | biostudies-literature

cOSPREY: A Cloud-Based Distributed Algorithm for Large-Scale Computational Protein Design.
| S-EPMC5586165 | biostudies-literature

Practical considerations for large-scale gut microbiome studies.
| S-EPMC7207147 | biostudies-literature

Studying Scale Dependency of Aerosol Cloud Interactions using Multi-Scale Cloud Formulations.
| S-EPMC7886284 | biostudies-literature

Large-scale Characteristics of Tropical Convective Systems through the Prism of Cloud Regime.
| S-EPMC7299438 | biostudies-literature

Large-scale microbiome data integration enables robust biomarker identification.
| S-EPMC10766547 | biostudies-literature

Inference of Large-scale Time-delayed Gene Regulatory Network with Parallel MapReduce Cloud Platform.
| S-EPMC6290780 | biostudies-literature

Design and implementation of a hybrid cloud system for large-scale human genomic research.
| S-EPMC9908893 | biostudies-literature