Project description:As the scale of biological data generation has increased, the bottleneck of research has shifted from data generation to analysis. Researchers commonly need to build computational workflows that include multiple analytic tools and require incremental development as experimental insights demand tool and parameter modifications. These workflows can produce hundreds to thousands of intermediate files and results that must be integrated for biological insight. Data-centric workflow systems that internally manage computational resources, software, and conditional execution of analysis steps are reshaping the landscape of biological data analysis and empowering researchers to conduct reproducible analyses at scale. Adoption of these tools can facilitate and expedite robust data analysis, but knowledge of these techniques is still lacking. Here, we provide a series of strategies for leveraging workflow systems with structured project, data, and resource management to streamline large-scale biological analysis. We present these practices in the context of high-throughput sequencing data analysis, but the principles are broadly applicable to biologists working beyond this field.
Project description:Different experimental technologies measure different aspects of a system and to differing depth and breadth. High-throughput assays have inherently high false-positive and false-negative rates. Moreover, each technology includes systematic biases of a different nature. These differences make network reconstruction from multiple data sets difficult and error-prone. Additionally, because of the rapid rate of progress in biotechnology, there is usually no curated exemplar data set from which one might estimate data integration parameters. To address these concerns, we have developed data integration methods that can handle multiple data sets differing in statistical power, type, size, and network coverage without requiring a curated training data set. Our methodology is general in purpose and may be applied to integrate data from any existing and future technologies. Here we outline our methods and then demonstrate their performance by applying them to simulated data sets. The results show that these methods select true-positive data elements much more accurately than classical approaches. In an accompanying companion paper, we demonstrate the applicability of our approach to biological data. We have integrated our methodology into a free open source software package named POINTILLIST.
Project description:Increasingly sophisticated experiments, coupled with large-scale computational models, have the potential to systematically test biological hypotheses to drive our understanding of multicellular systems. In this short review, we explore key challenges that must be overcome to achieve robust, repeatable data-driven multicellular systems biology. If these challenges can be solved, we can grow beyond the current state of isolated tools and datasets to a community-driven ecosystem of interoperable data, software utilities, and computational modeling platforms. Progress is within our grasp, but it will take community (and financial) commitment.
Project description:Cognitive computing is revolutionizing the way big data are processed and integrated, with artificial intelligence (AI) natural language processing (NLP) platforms helping researchers to efficiently search and digest the vast scientific literature. Most available platforms have been developed for biomedical researchers, but new NLP tools are emerging for biologists in other fields and an important example is metabolomics. NLP provides literature-based contextualization of metabolic features that decreases the time and expert-level subject knowledge required during the prioritization, identification and interpretation steps in the metabolomics data analysis pipeline. Here, we describe and demonstrate four workflows that combine metabolomics data with NLP-based literature searches of scientific databases to aid in the analysis of metabolomics data and their biological interpretation. The four procedures can be used in isolation or consecutively, depending on the research questions. The first, used for initial metabolite annotation and prioritization, creates a list of metabolites that would be interesting for follow-up. The second workflow finds literature evidence of the activity of metabolites and metabolic pathways in governing the biological condition on a systems biology level. The third is used to identify candidate biomarkers, and the fourth looks for metabolic conditions or drug-repurposing targets that the two diseases have in common. The protocol can take 1-4 h or more to complete, depending on the processing time of the various software used.
Project description:BackgroundModern biomedical research is often organized in collaborations involving labs worldwide. In particular in systems biology, complex molecular systems are analyzed that require the generation and interpretation of heterogeneous data for their explanation, for example ranging from gene expression studies and mass spectrometry measurements to experimental techniques for detecting molecular interactions and functional assays. XML has become the most prominent format for representing and exchanging these data. However, besides the development of standards there is still a fundamental lack of data integration systems that are able to utilize these exchange formats, organize the data in an integrative way and link it with applications for data interpretation and analysis.ResultsWe have developed DIPSBC, an interactive data integration platform supporting collaborative research projects, based on Foswiki, Solr/Lucene, and specific helper applications. We describe the main features of the implementation and highlight the performance of the system with several use cases. All components of the system are platform independent and open-source developments and thus can be easily adopted by researchers. An exemplary installation of the platform which also provides several helper applications and detailed instructions for system usage and setup is available at http://dipsbc.molgen.mpg.de.ConclusionsDIPSBC is a data integration platform for medium-scale collaboration projects that has been tested already within several research collaborations. Because of its modular design and the incorporation of XML data formats it is highly flexible and easy to use.
Project description:Systems biology approaches have been applied over the last two decades to study plant sulphur metabolism. These 'sulphur-omics' approaches have been developed in parallel with the advancing field of systems biology, which is characterized by permanent improvements of high-throughput methods to obtain system-wide data. The aim is to obtain a holistic view of sulphur metabolism and to generate models that allow predictions of metabolic and physiological responses. Besides known sulphur-responsive genes derived from previous studies, numerous genes have been identified in transcriptomics studies. This has not only increased our knowledge of sulphur metabolism but has also revealed links between metabolic processes, thus indicating a previously unexpected complex interconnectivity. The identification of response and control networks has been supported through metabolomics and proteomics studies. Due to the complex interlacing nature of biological processes, experimental validation using targeted or systems approaches is ongoing. There is still room for improvement in integrating the findings from studies of metabolomes, proteomes, and metabolic fluxes into a single unifying concept and to generate consistent models. We therefore suggest a joint effort of the sulphur research community to standardize data acquisition. Furthermore, focusing on a few different model plant systems would help overcome the problem of fragmented data, and would allow us to provide a standard data set against which future experiments can be designed and compared.
Project description:The integration of data from multiple global assays is essential to understanding dynamic spatiotemporal interactions within cells. In a companion paper, we reported a data integration methodology, designated Pointillist, that can handle multiple data types from technologies with different noise characteristics. Here we demonstrate its application to the integration of 18 data sets relating to galactose utilization in yeast. These data include global changes in mRNA and protein abundance, genome-wide protein-DNA interaction data, database information, and computational predictions of protein-DNA and protein-protein interactions. We divided the integration task to determine three network components: key system elements (genes and proteins), protein-protein interactions, and protein-DNA interactions. Results indicate that the reconstructed network efficiently focuses on and recapitulates the known biology of galactose utilization. It also provided new insights, some of which were verified experimentally. The methodology described here, addresses a critical need across all domains of molecular and cell biology, to effectively integrate large and disparate data sets.
Project description:BackgroundMany ontologies have been developed in biology and these ontologies increasingly contain large volumes of formalized knowledge commonly expressed in the Web Ontology Language (OWL). Computational access to the knowledge contained within these ontologies relies on the use of automated reasoning.ResultsWe have developed the Aber-OWL infrastructure that provides reasoning services for bio-ontologies. Aber-OWL consists of an ontology repository, a set of web services and web interfaces that enable ontology-based semantic access to biological data and literature. Aber-OWL is freely available at http://aber-owl.net .ConclusionsAber-OWL provides a framework for automatically accessing information that is annotated with ontologies or contains terms used to label classes in ontologies. When using Aber-OWL, access to ontologies and data annotated with them is not merely based on class names or identifiers but rather on the knowledge the ontologies contain and the inferences that can be drawn from it.