Dataset Information

Analyzing large scale genomic data on the cloud with Sparkhit.

ABSTRACT: Motivation:The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform. Results:Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92-157 times faster than MetaSpark on metagenomic fragment recruitment and 18-32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21?h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2?h, presenting an approach to easily associate large amounts of public datasets with reference data. Availability and implementation:Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/. Contact:asczyrba@cebitec.uni-bielefeld.de. Supplementary information:Supplementary data are available at Bioinformatics online.

SUBMITTER: Huang L

PROVIDER: S-EPMC5925781 | biostudies-literature | 2018 May

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Analyzing large scale genomic data on the cloud with Sparkhit.

Huang Liren L Krüger Jan J Sczyrba Alexander A

Bioinformatics (Oxford, England) 20180501 9

<h4>Motivation</h4>The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top ...[more]

PMID: 29253074

Dataset Information

Analyzing large scale genomic data on the cloud with Sparkhit.

Publications

Analyzing large scale genomic data on the cloud with Sparkhit.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data.
| S-EPMC7407261 | biostudies-literature

Analyzing gender inequality through large-scale Facebook advertising data.
| S-EPMC6142225 | biostudies-literature

Design and implementation of a hybrid cloud system for large-scale human genomic research.
| S-EPMC9908893 | biostudies-literature

A strategy for extracting and analyzing large-scale quantitative epistatic interaction data.
| S-EPMC1779568 | biostudies-literature

Identifying recent adaptations in large-scale genomic data.
| S-EPMC3674781 | biostudies-literature

Large scale microbiome profiling in the cloud.
| S-EPMC6612844 | biostudies-literature

High dimensional association detection in large-scale genomic data
2020-11-18 | GSE156074 | GEO

warpDOCK: Large-Scale Virtual Drug Discovery Using Cloud Infrastructure.
| S-EPMC10433467 | biostudies-literature

Cumulus provides cloud-based data analysis for large-scale single-cell and single-nucleus RNA-seq.
| S-EPMC7437817 | biostudies-literature

EPIC-CoGe: managing and analyzing genomic data.
| S-EPMC6061785 | biostudies-other