Dataset Information

A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines.

ABSTRACT: BACKGROUND: Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts. RESULTS: To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats). CONCLUSIONS: PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples.

SUBMITTER: Cieslik M

PROVIDER: S-EPMC3051902 | biostudies-other | 2011

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines.

Cieślik Marcin M Mura Cameron C

BMC bioinformatics 20110225

<h4>Background</h4>Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic ...[more]

PMID: 21352538

Similar Datasets

Project description:BackgroundAutomated bioinformatics workflows are more robust, easier to maintain, and results more reproducible when built with command-line utilities than with custom-coded scripts. Command-line utilities further benefit by relieving bioinformatics developers to learn the use of, or to interact directly with, biological software libraries. There is however a lack of command-line utilities that leverage popular Open Source biological software toolkits such as BioPerl ( http://bioperl.org ) to make many of the well-designed, robust, and routinely used biological classes available for a wider base of end users.ResultsDesigned as standard utilities for UNIX-family operating systems, BpWrapper makes functionality of some of the most popular BioPerl modules readily accessible on the command line to novice as well as to experienced bioinformatics practitioners. The initial release of BpWrapper includes four utilities with concise command-line user interfaces, bioseq, bioaln, biotree, and biopop, specialized for manipulation of molecular sequences, sequence alignments, phylogenetic trees, and DNA polymorphisms, respectively. Over a hundred methods are currently available as command-line options and new methods are easily incorporated. Performance of BpWrapper utilities lags that of precompiled utilities while equivalent to that of other utilities based on BioPerl. BpWrapper has been tested on BioPerl Release 1.6, Perl versions 5.10.1 to 5.25.10, and operating systems including Apple macOS, Microsoft Windows, and GNU/Linux. Release code is available from the Comprehensive Perl Archive Network (CPAN) at https://metacpan.org/pod/Bio::BPWrapper . Source code is available on GitHub at https://github.com/bioperl/p5-bpwrapper .ConclusionsBpWrapper improves on existing sequence utilities by following the design principles of Unix text utilities such including a concise user interface, extensive command-line options, and standard input/output for serialized operations. Further, dozens of novel methods for manipulation of sequences, alignments, and phylogenetic trees, unavailable in existing utilities (e.g., EMBOSS, Newick Utilities, and FAST), are provided. Bioinformaticians should find BpWrapper useful for rapid prototyping of workflows on the command-line without creating custom scripts for comparative genomics and other bioinformatics applications.

Project description:BackgroundThis article addresses the problem of interoperation of heterogeneous bioinformatics databases.ResultsWe introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research.ConclusionBioWarehouse embodies significant progress on the database integration problem for bioinformatics.

Project description:ObjectiveThe objectives of this paper are to 1) construct a new network model compatible with distributed computation, 2) construct the full optimal power flow (OPF) in a distributed fashion so that an effective, non-inferior solution can be found, and 3) develop a scalable algorithm that guarantees the convergence to a local minimum.Existing challengesDue to the nonconvexity of the problem, the search for a solution to OPF problems is not scalable, which makes the OPF highly limited for the system operation of large-scale real-world power grids-"the curse of dimensionality". The recent attempts at distributed computation aim for a scalable and efficient algorithm by reducing the computational cost per iteration in exchange of increased communication costs.MotivationA new network model allows for efficient computation without increasing communication costs. With the network model, recent advancements in distributed computation make it possible to develop an efficient and scalable algorithm suitable for large-scale OPF optimizations.MethodsWe propose a new network model in which all nodes are directly connected to the center node to keep the communication costs manageable. Based on the network model, we suggest a nodal distributed algorithm and direct communication to all nodes through the center node. We demonstrate that the suggested algorithm converges to a local minimum rather than a point, satisfying the first optimality condition.ResultsThe proposed algorithm identifies solutions to OPF problems in various IEEE model systems. The solutions are identical to those using a centrally optimized and heuristic approach. The computation time at each node does not depend on the system size, and Niter does not increase significantly with the system size.ConclusionOur proposed network model is a star network for maintaining the shortest node-to-node distances to allow a linear information exchange. The proposed algorithm guarantees the convergence to a local minimum rather than a maximum or a saddle point, and it maintains computational efficiency for a large-scale OPF, scalable algorithm.

Dataset Information

A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines.

Publications

A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets