Dataset Information

Sequence database versioning for command line and Galaxy bioinformatics servers.

ABSTRACT:

Motivation

There are various reasons for rerunning bioinformatics tools and pipelines on sequencing data, including reproducing a past result, validation of a new tool or workflow using a known dataset, or tracking the impact of database changes. For identical results to be achieved, regularly updated reference sequence databases must be versioned and archived. Database administrators have tried to fill the requirements by supplying users with one-off versions of databases, but these are time consuming to set up and are inconsistent across resources. Disk storage and data backup performance has also discouraged maintaining multiple versions of databases since databases such as NCBI nr can consume 50 Gb or more disk space per version, with growth rates that parallel Moore's law.

Results

Our end-to-end solution combines our own Kipper software package-a simple key-value large file versioning system-with BioMAJ (software for downloading sequence databases), and Galaxy (a web-based bioinformatics data processing platform). Available versions of databases can be recalled and used by command-line and Galaxy users. The Kipper data store format makes publishing curated FASTA databases convenient since in most cases it can store a range of versions into a file marginally larger than the size of the latest version.

Availability and implementation

Kipper v1.0.0 and the Galaxy Versioned Data tool are written in Python and released as free and open source software available at https://github.com/Public-Health-Bioinformatics/kipper and https://github.com/Public-Health-Bioinformatics/versioned_data, respectively; detailed setup instructions can be found at https://github.com/Public-Health-Bioinformatics/versioned_data/blob/master/doc/setup.md

Contact

: Damion.Dooley@Bccdc.Ca or William.Hsiao@Bccdc.CaSupplementary information: Supplementary data are available at Bioinformatics online.

SUBMITTER: Dooley DM

PROVIDER: S-EPMC4824126 | biostudies-literature | 2016 Apr

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Sequence database versioning for command line and Galaxy bioinformatics servers.

Dooley Damion M DM Petkau Aaron J AJ Van Domselaar Gary G Hsiao William W L WW

Bioinformatics (Oxford, England) 20151212 8

<h4>Motivation</h4>There are various reasons for rerunning bioinformatics tools and pipelines on sequencing data, including reproducing a past result, validation of a new tool or workflow using a known dataset, or tracking the impact of database changes. For identical results to be achieved, regularly updated reference sequence databases must be versioned and archived. Database administrators have tried to fill the requirements by supplying users with one-off versions of databases, but these are ...[more]

PMID: 26656932

Dataset Information

Sequence database versioning for command line and Galaxy bioinformatics servers.

Motivation

Results

Availability and implementation

Contact

Publications

Sequence database versioning for command line and Galaxy bioinformatics servers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

From command-line bioinformatics to bioGUI.
| S-EPMC6875409 | biostudies-literature

Bionitio: demonstrating and facilitating best practices for bioinformatics command-line software.
| S-EPMC6755254 | biostudies-literature

Resequencing of Microbial Isolates: A Lab Module to Introduce Novices to Command-Line Bioinformatics.
| S-EPMC8008064 | biostudies-literature

Exploring COVID-19 pathogenesis on command-line: A bioinformatics pipeline for handling and integrating omics data.
| S-EPMC9095070 | biostudies-literature

CIAlign: A highly customisable command line tool to clean, interpret and visualise multiple sequence alignments.
| S-EPMC8932311 | biostudies-literature

Hybkit: a Python API and command-line toolkit for hybrid sequence data from chimeric RNA methods.
| S-EPMC10701094 | biostudies-literature

RadAA: A Command-line Tool for Identification of Radical Amino Acid Changes in Multiple Sequence Alignments.
| S-EPMC6585820 | biostudies-literature

Increased coverage of protein families with the blocks database servers.
| S-EPMC102407 | biostudies-literature

Galaxy as a gateway to bioinformatics: Multi-Interface Galaxy Hands-on Training Suite (MIGHTS) for scRNA-seq.
| S-EPMC11707610 | biostudies-literature

Comprehensive Review of Web Servers and Bioinformatics Tools for Cancer Prognosis Analysis.
| S-EPMC7013087 | biostudies-literature