Dataset Information

Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures.

ABSTRACT: A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this report, we compare current memory-efficient techniques for genome assembly with respect to quality, memory consumption and execution time. Our experiments prove that it is possible to generate draft assemblies of reasonable quality on conventional multi-purpose computers with very limited available memory by choosing suitable assembly methods. Our study reveals the minimum memory requirements for different assembly programs even when data volume exceeds memory capacity by orders of magnitude. By combining existing methodologies, we propose two general assembly strategies that can improve short-read assembly approaches and result in reduction of the memory footprint. Finally, we discuss the possibility of utilizing cloud infrastructures for genome assembly and we comment on some findings regarding suitable computational resources for assembly.

SUBMITTER: Kleftogiannis D

PROVIDER: S-EPMC3785575 | biostudies-literature | 2013

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures.

Kleftogiannis Dimitrios D Kalnis Panos P Bajic Vladimir B VB

PloS one 20130927 9

A fundamental problem in bioinformatics is genome assembly. Next-generation sequencing (NGS) technologies produce large volumes of fragmented genome reads, which require large amounts of memory to assemble the complete genome efficiently. With recent improvements in DNA sequencing technologies, it is expected that the memory footprint required for the assembly process will increase dramatically and will emerge as a limiting factor in processing widely available NGS-generated reads. In this repor ...[more]

PMID: 24086547

Similar Datasets

Project description:Roche 454 pyrosequencing has become a method of choice for generating transcriptome data from non-model organisms. Once the tens to hundreds of thousands of short (250-450 base) reads have been produced, it is important to correctly assemble these to estimate the sequence of all the transcripts. Most transcriptome assembly projects use only one program for assembling 454 pyrosequencing reads, but there is no evidence that the programs used to date are optimal. We have carried out a systematic comparison of five assemblers (CAP3, MIRA, Newbler, SeqMan and CLC) to establish best practices for transcriptome assemblies, using a new dataset from the parasitic nematode Litomosoides sigmodontis.Although no single assembler performed best on all our criteria, Newbler 2.5 gave longer contigs, better alignments to some reference sequences, and was fast and easy to use. SeqMan assemblies performed best on the criterion of recapitulating known transcripts, and had more novel sequence than the other assemblers, but generated an excess of small, redundant contigs. The remaining assemblers all performed almost as well, with the exception of Newbler 2.3 (the version currently used by most assembly projects), which generated assemblies that had significantly lower total length. As different assemblers use different underlying algorithms to generate contigs, we also explored merging of assemblies and found that the merged datasets not only aligned better to reference sequences than individual assemblies, but were also more consistent in the number and size of contigs.Transcriptome assemblies are smaller than genome assemblies and thus should be more computationally tractable, but are often harder because individual contigs can have highly variable read coverage. Comparing single assemblers, Newbler 2.5 performed best on our trial data set, but other assemblers were closely comparable. Combining differently optimal assemblies from different programs however gave a more credible final product, and this strategy is recommended.

Project description:IntroductionNewly available, smartphone-enabled carbon monoxide (CO) monitors are lower in cost than traditional stand-alone monitors and represent a marked advancement for smoking research. New products are promising, but data are needed to compare breath CO readings between smartphone-enabled and stand-alone monitors. The purpose of this study was to (1) determine the agreement between the mobile iCO (Bedfont Scientific Ltd) with two other monitors from the same manufacturer (Micro+ pro and Micro+ basic) and (2) determine optimal, monitor-specific, cotinine-confirmed abstinence cutoff values.MethodsAdult (≥18) smokers (n = 26) and nonsmokers (n = 21) provided three breath CO samples (using three different monitors) in each of 10 sessions, and urine cotinine was measured for gold standard determination of abstinence. CO comparisons (N = 437) were analyzed using regression-based Bland-Altman Analysis of Agreement; receiver operating characteristics curves were used to determine optimal abstinence cutoffs.ResultsBland-Altman analyses indicated that the iCO monitor provided higher CO results than both Micro+ monitors. Sensitivity and specificity analyses showed that the optimal CO cutoff for determining abstinence was <3 ppm for the Micro+ pro (88% sensitivity, 93% specificity) and Micro+ basic (83% sensitivity, 98% specificity), but was higher for the iCO (<6 ppm; 73% sensitivity, 100% specificity).ConclusionsRelative to both Micro+ monitors, the smartphone-enabled iCO provided systematically higher CO values and required a higher cutoff to reliably determine smoking abstinence. This does not indicate that CO values obtained using the iCO are not valid; instead, these results suggest that monitor-specific abstinence cutoffs are needed to ensure accurate bioverification of smoking status.ImplicationsResults from this study indicate that CO values from the smartphone-enabled iCO should not be used interchangeably with the stand-alone Micro+ pro and Micro+ basic, particularly when lower CO values (<10 ppm) are critical (ie, determination of abstinence vs confirming smoking status for study inclusion). Optimal CO cutoffs recommended for determining abstinence on Micro+ and iCO monitors are at <3 and <6 ppm, respectively.

Project description:While the popular workflow manager Galaxy is currently made available through several publicly accessible servers, there are scenarios where users can be better served by full administrative control over a private Galaxy instance, including, but not limited to, concerns about data privacy, customisation needs, prioritisation of particular job types, tools development, and training activities. In such cases, a cloud-based Galaxy virtual instance represents an alternative that equips the user with complete control over the Galaxy instance itself without the burden of the hardware and software infrastructure involved in running and maintaining a Galaxy server. We present Laniakea, a complete software solution to set up a "Galaxy on-demand" platform as a service. Building on the INDIGO-DataCloud software stack, Laniakea can be deployed over common cloud architectures usually supported both by public and private e-infrastructures. The user interacts with a Laniakea-based service through a simple front-end that allows a general setup of a Galaxy instance, and then Laniakea takes care of the automatic deployment of the virtual hardware and the software components. At the end of the process, the user gains access with full administrative privileges to a private, production-grade, fully customisable, Galaxy virtual instance and to the underlying virtual machine (VM). Laniakea features deployment of single-server or cluster-backed Galaxy instances, sharing of reference data across multiple instances, data volume encryption, and support for VM image-based, Docker-based, and Ansible recipe-based Galaxy deployments. A Laniakea-based Galaxy on-demand service, named Laniakea@ReCaS, is currently hosted at the ELIXIR-IT ReCaS cloud facility. Laniakea offers to scientific e-infrastructures a complete and easy-to-use software solution to provide a Galaxy on-demand service to their users. Laniakea-based cloud services will help in making Galaxy more accessible to a broader user base by removing most of the burdens involved in deploying and running a Galaxy service. In turn, this will facilitate the adoption of Galaxy in scenarios where classic public instances do not represent an optimal solution. Finally, the implementation of Laniakea can be easily adapted and expanded to support different services and platforms beyond Galaxy.

Dataset Information

Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures.

Publications

Comparing memory-efficient genome assemblers on stand-alone and cloud infrastructures.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets