Dataset Information

DolphinNext: a distributed data processing platform for high throughput genomics.

ABSTRACT:

Background

The emergence of high throughput technologies that produce vast amounts of genomic data, such as next-generation sequencing (NGS) is transforming biological research. The dramatic increase in the volume of data, the variety and continuous change of data processing tools, algorithms and databases make analysis the main bottleneck for scientific discovery. The processing of high throughput datasets typically involves many different computational programs, each of which performs a specific step in a pipeline. Given the wide range of applications and organizational infrastructures, there is a great need for highly parallel, flexible, portable, and reproducible data processing frameworks. Several platforms currently exist for the design and execution of complex pipelines. Unfortunately, current platforms lack the necessary combination of parallelism, portability, flexibility and/or reproducibility that are required by the current research environment. To address these shortcomings, workflow frameworks that provide a platform to develop and share portable pipelines have recently arisen. We complement these new platforms by providing a graphical user interface to create, maintain, and execute complex pipelines. Such a platform will simplify robust and reproducible workflow creation for non-technical users as well as provide a robust platform to maintain pipelines for large organizations.

Results

To simplify development, maintenance, and execution of complex pipelines we created DolphinNext. DolphinNext facilitates building and deployment of complex pipelines using a modular approach implemented in a graphical interface that relies on the powerful Nextflow workflow framework by providing 1. A drag and drop user interface that visualizes pipelines and allows users to create pipelines without familiarity in underlying programming languages. 2. Modules to execute and monitor pipelines in distributed computing environments such as high-performance clusters and/or cloud 3. Reproducible pipelines with version tracking and stand-alone versions that can be run independently. 4. Modular process design with process revisioning support to increase reusability and pipeline development efficiency. 5. Pipeline sharing with GitHub and automated testing 6. Extensive reports with R-markdown and shiny support for interactive data visualization and analysis.

Conclusion

DolphinNext is a flexible, intuitive, web-based data processing and analysis platform that enables creating, deploying, sharing, and executing complex Nextflow pipelines with extensive revisioning and interactive reporting to enhance reproducible results.

SUBMITTER: Yukselen O

PROVIDER: S-EPMC7168977 | biostudies-literature | 2020 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

DolphinNext: a distributed data processing platform for high throughput genomics.

Yukselen Onur O Turkyilmaz Osman O Ozturk Ahmet Rasit AR Garber Manuel M Kucukural Alper A

BMC genomics 20200419 1

<h4>Background</h4>The emergence of high throughput technologies that produce vast amounts of genomic data, such as next-generation sequencing (NGS) is transforming biological research. The dramatic increase in the volume of data, the variety and continuous change of data processing tools, algorithms and databases make analysis the main bottleneck for scientific discovery. The processing of high throughput datasets typically involves many different computational programs, each of which performs ...[more]

PMID: 32306927

Similar Datasets

Project description:Modern, high-throughput animal tracking increasingly yields 'big data' at very fine temporal scales. At these scales, location error can exceed the animal's step size, leading to mis-estimation of behaviours inferred from movement. 'Cleaning' the data to reduce location errors is one of the main ways to deal with position uncertainty. Although data cleaning is widely recommended, inclusive, uniform guidance on this crucial step, and on how to organise the cleaning of massive datasets, is relatively scarce. A pipeline for cleaning massive high-throughput datasets must balance ease of use and computationally efficiency, in which location errors are rejected while preserving valid animal movements. Another useful feature of a pre-processing pipeline is efficiently segmenting and clustering location data for statistical methods while also being scalable to large datasets and robust to imperfect sampling. Manual methods being prohibitively time-consuming, and to boost reproducibility, pre-processing pipelines must be automated. We provide guidance on building pipelines for pre-processing high-throughput animal tracking data to prepare it for subsequent analyses. We apply our proposed pipeline to simulated movement data with location errors, and also show how large volumes of cleaned data can be transformed into biologically meaningful 'residence patches', for exploratory inference on animal space use. We use tracking data from the Wadden Sea ATLAS system (WATLAS) to show how pre-processing improves its quality, and to verify the usefulness of the residence patch method. Finally, with tracks from Egyptian fruit bats Rousettus aegyptiacus, we demonstrate the pre-processing pipeline and residence patch method in a fully worked out example. To help with fast implementation of standardised methods, we developed the R package atlastools, which we also introduce here. Our pre-processing pipeline and atlastools can be used with any high-throughput animal movement data in which the high data-volume combined with knowledge of the tracked individuals' movement capacity can be used to reduce location errors. atlastools is easy to use for beginners while providing a template for further development. The common use of simple yet robust pre-processing steps promotes standardised methods in the field of movement ecology and leads to better inferences from data.

Dataset Information

DolphinNext: a distributed data processing platform for high throughput genomics.

Background

Results

Conclusion

Publications

DolphinNext: a distributed data processing platform for high throughput genomics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets