Dataset Information

A linear programming approach for estimating the structure of a sparse linear genetic network from transcript profiling data.

ABSTRACT: BACKGROUND: A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional transcript profiling data assume sparsity and linearity. Many methods consider relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work examines large undirected graphs representations of genetic networks, graphs with many thousands of nodes where an undirected edge between two nodes does not indicate the direction of influence, and the problem of estimating the structure of such a sparse linear genetic network (SLGN) from transcript profiling data. RESULTS: The structure learning task is cast as a sparse linear regression problem which is then posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are comparable to those proposed by the first and/or second ranked teams in the DREAM2 competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae LP-SLGN, the number of nodes with a particular degree follows an approximate power law suggesting that its degree distributions is similar to that observed in real-world networks. Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental verification. CONCLUSION: A statistically robust and computationally efficient LP-based method for estimating the topology of a large sparse undirected graph from high-dimensional data yields representations of genetic networks that are biologically plausible and useful abstractions of the structures of real genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may have practical value; for example, genes with high random walk betweenness, a measure of the centrality of a node in a graph, are good candidates for intervention studies and hence integrated computational - experimental investigations designed to infer more realistic and sophisticated probabilistic directed graphical model representations of genetic networks. The LP-based solutions of the sparse linear regression problem described here may provide a method for learning the structure of transcription factor networks from transcript profiling and transcription factor binding motif data.

SUBMITTER: Bhadra S

PROVIDER: S-EPMC2654898 | biostudies-other | 2009

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

A linear programming approach for estimating the structure of a sparse linear genetic network from transcript profiling data.

Bhadra Sahely S Bhattacharyya Chiranjib C Chandra Nagasuma R NR Mian I Saira IS

Algorithms for molecular biology : AMB 20090224

<h4>Background</h4>A genetic network can be represented as a directed graph in which a node corresponds to a gene and a directed edge specifies the direction of influence of one gene on another. The reconstruction of such networks from transcript profiling data remains an important yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological sample of interest. Prevailing strategies for learning the structure of a genetic network from high-dimensional tra ...[more]

PMID: 19239685

Similar Datasets

Project description:Cross-referencing experimental data with our current knowledge of signaling network topologies is one central goal of mathematical modeling of cellular signal transduction networks. We present a new methodology for data-driven interrogation and training of signaling networks. While most published methods for signaling network inference operate on Bayesian, Boolean, or ODE models, our approach uses integer linear programming (ILP) on interaction graphs to encode constraints on the qualitative behavior of the nodes. These constraints are posed by the network topology and their formulation as ILP allows us to predict the possible qualitative changes (up, down, no effect) of the activation levels of the nodes for a given stimulus. We provide four basic operations to detect and remove inconsistencies between measurements and predicted behavior: (i) find a topology-consistent explanation for responses of signaling nodes measured in a stimulus-response experiment (if none exists, find the closest explanation); (ii) determine a minimal set of nodes that need to be corrected to make an inconsistent scenario consistent; (iii) determine the optimal subgraph of the given network topology which can best reflect measurements from a set of experimental scenarios; (iv) find possibly missing edges that would improve the consistency of the graph with respect to a set of experimental scenarios the most. We demonstrate the applicability of the proposed approach by interrogating a manually curated interaction graph model of EGFR/ErbB signaling against a library of high-throughput phosphoproteomic data measured in primary hepatocytes. Our methods detect interactions that are likely to be inactive in hepatocytes and provide suggestions for new interactions that, if included, would significantly improve the goodness of fit. Our framework is highly flexible and the underlying model requires only easily accessible biological knowledge. All related algorithms were implemented in a freely available toolbox SigNetTrainer making it an appealing approach for various applications.

Project description:Horizontal behavior of highly migratory marine species is difficult to decipher because animals are wide-ranging, spend minimal time at the ocean surface, and utilize remote habitats. Satellite telemetry enables researchers to track individual movements, but population level inferences are rare due to data limitations that result from difficulty of capture and sporadic tag reporting. We introduce a Bayesian modeling framework to address population level questions with satellite telemetry data when data are sparse. We also outline an approach for identifying informative variables for use within the model. We tested our modeling approach using a large telemetry dataset for Shortfin Makos (Isurus oxyrinchus), which allowed us to assess the effects of various degrees of data paucity. First, a permuted Random Forest analysis is implemented to determine which variables are most informative. Next, a generalized additive mixed model is used to help define the relationship of each remaining variable with the response variable. Using jags and rjags for the analysis of Bayesian hierarchical models using Markov Chain Monte Carlo simulation, we then developed a movement model to generate parameter estimates for each of the variables of interest. By randomly reducing the tagging dataset by 25, 50, 75, and 90 percent and recalculating the parameter estimates, we demonstrate that the proposed Bayesian approach can be applied in data-limited situations. We also demonstrate how two commonly used linear mixed models with maximum likelihood estimation (MLE) can be similarly applied. Additionally, we simulate data from known parameter values to test each model's ability to recapture those values. Despite performing similarly, we advocate using the Bayesian over the MLE approach due to the ability for later studies to easily utilize results of past study to inform working models, and the ability to use prior knowledge via informed priors in systems where such information is available.

Project description:In modern scientific research, data are often collected from multiple modalities. Since different modalities could provide complementary information, statistical prediction methods using multi-modality data could deliver better prediction performance than using single modality data. However, one special challenge for using multi-modality data is related to block-missing data. In practice, due to dropouts or the high cost of measures, the observations of a certain modality can be missing completely for some subjects. In this paper, we propose a new DIrect Sparse regression procedure using COvariance from Multi-modality data (DISCOM). Our proposed DISCOM method includes two steps to find the optimal linear prediction of a continuous response variable using block-missing multi-modality predictors. In the first step, rather than deleting or imputing missing data, we make use of all available information to estimate the covariance matrix of the predictors and the cross-covariance vector between the predictors and the response variable. The proposed new estimate of the covariance matrix is a linear combination of the identity matrix, the estimates of the intra-modality covariance matrix and the cross-modality covariance matrix. Flexible estimates for both the sub-Gaussian and heavy-tailed cases are considered. In the second step, based on the estimated covariance matrix and the estimated cross-covariance vector, an extended Lasso-type estimator is used to deliver a sparse estimate of the coefficients in the optimal linear prediction. The number of samples that are effectively used by DISCOM is the minimum number of samples with available observations from two modalities, which can be much larger than the number of samples with complete observations from all modalities. The effectiveness of the proposed method is demonstrated by theoretical studies, simulated examples, and a real application from the Alzheimer's Disease Neuroimaging Initiative. The comparison between DISCOM and some existing methods also indicates the advantages of our proposed method.

Dataset Information

A linear programming approach for estimating the structure of a sparse linear genetic network from transcript profiling data.

Publications

A linear programming approach for estimating the structure of a sparse linear genetic network from transcript profiling data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets