Project description:We present elPrep 4, a reimplementation from scratch of the elPrep framework for processing sequence alignment map files in the Go programming language. elPrep 4 includes multiple new features allowing us to process all of the preparation steps defined by the GATK Best Practice pipelines for variant calling. This includes new and improved functionality for sorting, (optical) duplicate marking, base quality score recalibration, BED and VCF parsing, and various filtering options. The implementations of these options in elPrep 4 faithfully reproduce the outcomes of their counterparts in GATK 4, SAMtools, and Picard, even though the underlying algorithms are redesigned to take advantage of elPrep's parallel execution framework to vastly improve the runtime and resource use compared to these tools. Our benchmarks show that elPrep executes the preparation steps of the GATK Best Practices up to 13x faster on WES data, and up to 7.4x faster for WGS data compared to running the same pipeline with GATK 4, while utilizing fewer compute resources.
Project description:We present elPrep 5, which updates the elPrep framework for processing sequencing alignment/map files with variant calling. elPrep 5 can now execute the full pipeline described by the GATK Best Practices for variant calling, which consists of PCR and optical duplicate marking, sorting by coordinate order, base quality score recalibration, and variant calling using the haplotype caller algorithm. elPrep 5 produces identical BAM and VCF output as GATK4 while significantly reducing the runtime by parallelizing and merging the execution of the pipeline steps. Our benchmarks show that elPrep 5 speeds up the runtime of the variant calling pipeline by a factor 8-16x on both whole-exome and whole-genome data while using the same hardware resources as GATK4. This makes elPrep 5 a suitable drop-in replacement for GATK4 when faster execution times are needed.
Project description:Land managers increasingly rely upon landscape assessments to understand the status of natural resources and identify conservation priorities. Many of these landscape planning efforts rely on geospatial models that characterize the ecological integrity of the landscape. These general models utilize measures of habitat disturbance and human activity to map indices of ecological integrity. We built upon these modeling frameworks by developing a Landscape Integrity Index (LII) model using geospatial datasets of the human footprint, as well as incorporation of other indicators of ecological integrity such as biodiversity and vegetation departure. Our LII model serves as a general indicator of ecological integrity in a regional context of human activity, biodiversity, and change in habitat composition. We also discuss the application of the LII framework in two related coarse-filter landscape conservation approaches to expand the size and connectedness of protected areas as regional mitigation for anticipated land-use changes.
Project description:Structured RNAs can be hard to search for as they often are not well conserved in their primary structure and are local in their genomic or transcriptomic context. Thus, the need for tools which in particular can make local structural alignments of RNAs is only increasing.To meet the demand for both large-scale screens and hands on analysis through web servers, we present a new multithreaded version of Foldalign. We substantially improve execution time while maintaining all previous functionalities, including carrying out local structural alignments of sequences with low similarity. Furthermore, the improvements allow for comparing longer RNAs and increasing the sequence length. For example, lengths in the range 2000-6000 nucleotides improve execution up to a factor of five.The Foldalign software and the web server are available at http://rth.dk/resources/foldaligngorodkin@rth.dkSupplementary data are available at Bioinformatics online.
Project description:BackgroundStriking a balance between the degree of model complexity and parameter identifiability, while still producing biologically feasible simulations using modelling is a major challenge in computational biology. While these two elements of model development are closely coupled, parameter fitting from measured data and analysis of model mechanisms have traditionally been performed separately and sequentially. This process produces potential mismatches between model and data complexities that can compromise the ability of computational frameworks to reveal mechanistic insights or predict new behaviour. In this study we address this issue by presenting a generic framework for combined model parameterisation, comparison of model alternatives and analysis of model mechanisms.ResultsThe presented methodology is based on a combination of multivariate metamodelling (statistical approximation of the input-output relationships of deterministic models) and a systematic zooming into biologically feasible regions of the parameter space by iterative generation of new experimental designs and look-up of simulations in the proximity of the measured data. The parameter fitting pipeline includes an implicit sensitivity analysis and analysis of parameter identifiability, making it suitable for testing hypotheses for model reduction. Using this approach, under-constrained model parameters, as well as the coupling between parameters within the model are identified. The methodology is demonstrated by refitting the parameters of a published model of cardiac cellular mechanics using a combination of measured data and synthetic data from an alternative model of the same system. Using this approach, reduced models with simplified expressions for the tropomyosin/crossbridge kinetics were found by identification of model components that can be omitted without affecting the fit to the parameterising data. Our analysis revealed that model parameters could be constrained to a standard deviation of on average 15% of the mean values over the succeeding parameter sets.ConclusionsOur results indicate that the presented approach is effective for comparing model alternatives and reducing models to the minimum complexity replicating measured data. We therefore believe that this approach has significant potential for reparameterising existing frameworks, for identification of redundant model components of large biophysical models and to increase their predictive capacity.
Project description:The value of models that link organism-level impacts to the responses of a population in ecological risk assessments (ERAs) has been demonstrated extensively over the past few decades. There is little debate about the utility of these models to translate multiple organism-level endpoints into a holistic interpretation of effect to the population; however, there continues to be a struggle for actual application of these models as a common practice in ERA. Although general frameworks for developing models for ERA have been proposed, there is limited guidance on when models should be used, in what form, and how to interpret model output to inform the risk manager's decision. We propose a framework for developing and applying population models in regulatory decision making that focuses on trade-offs of generality, realism, and precision for both ERAs and models. We approach the framework development from the perspective of regulators aimed at defining the needs of specific models commensurate with the assessment objective. We explore why models are not widely used by comparing their requirements and limitations with the needs of regulators. Using a series of case studies under specific regulatory frameworks, we classify ERA objectives by trade-offs of generality, realism, and precision and demonstrate how the output of population models developed with these same trade-offs informs the ERA objective. We examine attributes for both assessments and models that aid in the discussion of these trade-offs. The proposed framework will assist risk assessors and managers to identify models of appropriate complexity and to understand the utility and limitations of a model's output and associated uncertainty in the context of their assessment goals. Integr Environ Assess Manag 2018;14:369-380. Published 2017. This article is a US Government work and is in the public domain in the USA.
Project description:Cytoreduction surgery (CRS) followed by hyperthermic intra-operative peritoneal chemotherapy (HIPEC) is a relatively new treatment for selected patients with peritoneal metastases of colorectal origin (PMCR). Data from outside of trials suggest that CRS and HIPEC improves survival compared with the current standard care (chemotherapy).
The big challenge is to do trials in this setting - as the intervention is complex, and there are wide variations in the process and recording of outcomes. If trials can confirm the findings from non-randomised studies there are an estimated 1000 to 2000 patients who may benefit from this intervention in the UK each year. The aim of this study is to develop a framework which can be used to undertake a randomised trial in patients with PMCR suitable for CRS with or without HIPEC.
The investigators will address this using the principles of the IDEAL (Idea, Development, Evaluation, Assessment & Long term study) framework. Here, a pre-trial feasibility study will be performed between the two national peritoneal tumour treatment centres (Manchester and Basingstoke).
This study is designed as such that it will take place over the following four stages:
Stage 1. Comparing the treatment data from 100 operations from each of the two centres to identify which of the key components of the intervention differ as well as testing for differences in overall survival and recurrence free survival.
Stage 2. Identifying sources of these differences by selecting up to 25 patients and investigating the variation in the way surgeons score key aspects of the procedure
Stage 3. Development of a ‘trial manual’ with standardised definitions (to minimise any differences)
Stage 4. Test how well people follow the manual in practice. After this study is complete, it will be possible to use the resulting trial manual to design future randomised trials to test the most suitable clinical question.
Project description:BackgroundThe prediction of the structure of large RNAs remains a particular challenge in bioinformatics, due to the computational complexity and low levels of accuracy of state-of-the-art algorithms. The pfold model couples a stochastic context-free grammar to phylogenetic analysis for a high accuracy in predictions, but the time complexity of the algorithm and underflow errors have prevented its use for long alignments. Here we present PPfold, a multithreaded version of pfold, which is capable of predicting the structure of large RNA alignments accurately on practical timescales.ResultsWe have distributed both the phylogenetic calculations and the inside-outside algorithm in PPfold, resulting in a significant reduction of runtime on multicore machines. We have addressed the floating-point underflow problems of pfold by implementing an extended-exponent datatype, enabling PPfold to be used for large-scale RNA structure predictions. We have also improved the user interface and portability: alongside standalone executable and Java source code of the program, PPfold is also available as a free plugin to the CLC Workbenches. We have evaluated the accuracy of PPfold using BRaliBase I tests, and demonstrated its practical use by predicting the secondary structure of an alignment of 24 complete HIV-1 genomes in 65 minutes on an 8-core machine and identifying several known structural elements in the prediction.ConclusionsPPfold is the first parallelized comparative RNA structure prediction algorithm to date. Based on the pfold model, PPfold is capable of fast, high-quality predictions of large RNA secondary structures, such as the genomes of RNA viruses or long genomic transcripts. The techniques used in the parallelization of this algorithm may be of general applicability to other bioinformatics algorithms.
Project description:Sequential region labelling, also known as connected components labelling, is a standard image segmentation problem that joins contiguous foreground pixels into blobs. Despite its long development history and widespread use across diverse domains such as bone biology, materials science and geology, connected components labelling can still form a bottleneck in image processing pipelines. Here, I describe a multithreaded implementation of classical two-pass sequential region labelling and introduce an efficient collision resolution step, 'bucket fountain'. Code was validated on test images and against commercial software (Avizo). It was performance tested on images from 2 MB (161 particles) to 6.5 GB (437 508 particles) to determine whether theoretical linear scaling (O(n)) had been achieved, and on 1-40 CPU threads to measure speed improvements due to multithreading. The new implementation achieves linear scaling (b = 0.905-1.052, time ∝ pixelsb ; R 2 = 0.985-0.996), which improves with increasing thread number up to 8-16 threads, suggesting that it is memory bandwidth limited. This new implementation of sequential region labelling reduces the time required from hours to a few tens of seconds for images of several GB, and is limited only by hardware scale. It is available open source and free of charge in BoneJ.
Project description:MotivationGenome scale metabolic models (GSMMs) are increasingly important for systems biology and metabolic engineering research as they are capable of simulating complex steady-state behaviour. Constraints based models of this form can include thousands of reactions and metabolites, with many crucial pathways that only become activated in specific simulation settings. However, despite their widespread use, power and the availability of tools to aid with the construction and analysis of large scale models, little methodology is suggested for their continued management. For example, when genome annotations are updated or new understanding regarding behaviour is discovered, models often need to be altered to reflect this. This is quickly becoming an issue for industrial systems and synthetic biotechnology applications, which require good quality reusable models integral to the design, build, test and learn cycle.ResultsAs part of an ongoing effort to improve genome scale metabolic analysis, we have developed a test-driven development methodology for the continuous integration of validation data from different sources. Contributing to the open source technology based around COBRApy, we have developed the gsmodutils modelling framework placing an emphasis on test-driven design of models through defined test cases. Crucially, different conditions are configurable allowing users to examine how different designs or curation impact a wide range of system behaviours, minimizing error between model versions.Availability and implementationThe software framework described within this paper is open source and freely available from http://github.com/SBRCNottingham/gsmodutils.Supplementary informationSupplementary data are available at Bioinformatics online.