Dataset Information

A data integration methodology for systems biology.

ABSTRACT: Different experimental technologies measure different aspects of a system and to differing depth and breadth. High-throughput assays have inherently high false-positive and false-negative rates. Moreover, each technology includes systematic biases of a different nature. These differences make network reconstruction from multiple data sets difficult and error-prone. Additionally, because of the rapid rate of progress in biotechnology, there is usually no curated exemplar data set from which one might estimate data integration parameters. To address these concerns, we have developed data integration methods that can handle multiple data sets differing in statistical power, type, size, and network coverage without requiring a curated training data set. Our methodology is general in purpose and may be applied to integrate data from any existing and future technologies. Here we outline our methods and then demonstrate their performance by applying them to simulated data sets. The results show that these methods select true-positive data elements much more accurately than classical approaches. In an accompanying companion paper, we demonstrate the applicability of our approach to biological data. We have integrated our methodology into a free open source software package named POINTILLIST.

SUBMITTER: Hwang D

PROVIDER: S-EPMC1297682 | biostudies-literature | 2005 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A data integration methodology for systems biology.

Hwang Daehee D Rust Alistair G AG Ramsey Stephen S Smith Jennifer J JJ Leslie Deena M DM Weston Andrea D AD de Atauri Pedro P Aitchison John D JD Hood Leroy L Siegel Andrew F AF Bolouri Hamid H

Proceedings of the National Academy of Sciences of the United States of America 20051121 48

Different experimental technologies measure different aspects of a system and to differing depth and breadth. High-throughput assays have inherently high false-positive and false-negative rates. Moreover, each technology includes systematic biases of a different nature. These differences make network reconstruction from multiple data sets difficult and error-prone. Additionally, because of the rapid rate of progress in biotechnology, there is usually no curated exemplar data set from which one m ...[more]

PMID: 16301537

Similar Datasets

Project description:Modern, high-throughput biological experiments generate copious, heterogeneous, interconnected data sets. Research is dynamic, with frequently changing protocols, techniques, instruments, and file formats. Because of these factors, systems designed to manage and integrate modern biological data sets often end up as large, unwieldy databases that become difficult to maintain or evolve. The novel rule-based approach of the Ultra-Structure design methodology presents a potential solution to this problem. By representing both data and processes as formal rules within a database, an Ultra-Structure system constitutes a flexible framework that enables users to explicitly store domain knowledge in both a machine- and human-readable form. End users themselves can change the system's capabilities without programmer intervention, simply by altering database contents; no computer code or schemas need be modified. This provides flexibility in adapting to change, and allows integration of disparate, heterogenous data sets within a small core set of database tables, facilitating joint analysis and visualization without becoming unwieldy. Here, we examine the application of Ultra-Structure to our ongoing research program for the integration of large proteomic and genomic data sets (proteogenomic mapping).We transitioned our proteogenomic mapping information system from a traditional entity-relationship design to one based on Ultra-Structure. Our system integrates tandem mass spectrum data, genomic annotation sets, and spectrum/peptide mappings, all within a small, general framework implemented within a standard relational database system. General software procedures driven by user-modifiable rules can perform tasks such as logical deduction and location-based computations. The system is not tied specifically to proteogenomic research, but is rather designed to accommodate virtually any kind of biological research.We find Ultra-Structure offers substantial benefits for biological information systems, the largest being the integration of diverse information sources into a common framework. This facilitates systems biology research by integrating data from disparate high-throughput techniques. It also enables us to readily incorporate new data types, sources, and domain knowledge with no change to the database structure or associated computer code. Ultra-Structure may be a significant step towards solving the hard problem of data management and integration in the systems biology era.

Project description:BackgroundThe Bioinformatics Resource Manager (BRM) is a web-based tool developed to facilitate identifier conversion and data integration for Homo sapiens (human), Mus musculus (mouse), Rattus norvegicus (rat), Danio rerio (zebrafish), and Macaca mulatta (macaque), as well as perform orthologous conversions among the supported species. In addition to providing a robust means of identifier conversion, BRM also incorporates a suite of microRNA (miRNA)-target databases upon which to query target genes or to perform reverse target lookups using gene identifiers.ResultsBRM has the capability to perform cross-species identifier lookups across common identifier types, directly integrate datasets across platform or species by performing identifier retrievals in the background, and retrieve miRNA targets from multiple databases simultaneously and integrate the resulting gene targets with experimental mRNA data. Here we use workflows provided in BRM to integrate RNA sequencing data across species to identify common biomarkers of exposure after treatment of human lung cells and zebrafish to benzo[a]pyrene (BAP). We further use the miRNA Target workflow to experimentally determine the role of miRNAs as regulators of BAP toxicity and identify the predicted functional consequences of miRNA-target regulation in our system. The output from BRM can easily and directly be uploaded to freely available visualization tools for further analysis. From these examples, we were able to identify an important role for several miRNAs as potential regulators of BAP toxicity in human lung cells associated with cell migration, cell communication, cell junction assembly and regulation of cell death.ConclusionsOverall, BRM provides bioinformatics tools to assist biologists having minimal programming skills with analysis and integration of high-content omics' data from various transcriptomic and proteomic platforms. BRM workflows were developed in Java and other open-source technologies and are served publicly using Apache Tomcat at https://cbb.pnnl.gov/brm/ .

Dataset Information

A data integration methodology for systems biology.

Publications

A data integration methodology for systems biology.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets