Dataset Information

VARIABLE SELECTION AND REGRESSION ANALYSIS FOR GRAPH-STRUCTURED COVARIATES WITH AN APPLICATION TO GENOMICS.

ABSTRACT: Graphs and networks are common ways of depicting information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This kind of a priori use of graphs is a useful supplement to the standard numerical data such as microarray gene expression data. In this paper, we consider the problem of regression analysis and variable selection when the covariates are linked on a graph. We study a graph-constrained regularization procedure and its theoretical properties for regression analysis to take into account the neighborhood information of the variables measured on a graph, where a smoothness penalty on the coefficients is defined as a quadratic form of the Laplacian matrix associated with the graph. We establish estimation and model selection consistency results and provide estimation bounds for both fixed and diverging numbers of parameters in regression models. We demonstrate by simulations and a real dataset that the proposed procedure can lead to better variable selection and prediction than existing methods that ignore the graph information associated with the covariates.

SUBMITTER: Li C

PROVIDER: S-EPMC3423227 | biostudies-literature | 2010 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

VARIABLE SELECTION AND REGRESSION ANALYSIS FOR GRAPH-STRUCTURED COVARIATES WITH AN APPLICATION TO GENOMICS.

Li Caiyan C Li Hongzhe H

The annals of applied statistics 20100901 3

Graphs and networks are common ways of depicting information. In biology, many different biological processes are represented by graphs, such as regulatory networks, metabolic pathways and protein-protein interaction networks. This kind of a priori use of graphs is a useful supplement to the standard numerical data such as microarray gene expression data. In this paper, we consider the problem of regression analysis and variable selection when the covariates are linked on a graph. We study a gra ...[more]

PMID: 22916087

Similar Datasets

Project description:BackgroundEvaluation of gene interaction models in cancer genomics is challenging, as the true distribution is uncertain. Previous analyses have benchmarked models using synthetic data or databases of experimentally verified interactions - approaches which are susceptible to misrepresentation and incompleteness, respectively. The objectives of this analysis are to (1) provide a real-world data-driven approach for comparing performance of genomic model inference algorithms, (2) compare the performance of LASSO, elastic net, best-subset selection, L0L1 penalisation and L0L2 penalisation in real genomic data and (3) compare algorithmic preselection according to performance in our benchmark datasets to algorithmic selection by internal cross-validation.MethodsFive large (n4000) genomic datasets were extracted from Gene Expression Omnibus. 'Gold-standard' regression models were trained on subspaces of these datasets ( n4000 , p=500 ). Penalised regression models were trained on small samples from these subspaces ( n∈{25,75,150},p=500 ) and validated against the gold-standard models. Variable selection performance and out-of-sample prediction were assessed. Penalty 'preselection' according to test performance in the other 4 datasets was compared to selection internal cross-validation error minimisation.ResultsL1L2 -penalisation achieved the highest cosine similarity between estimated coefficients and those of gold-standard models. L0L2 -penalised models explained the greatest proportion of variance in test responses, though performance was unreliable in low signal:noise conditions. L0L2 also attained the highest overall median variable selection F1 score. Penalty preselection significantly outperformed selection by internal cross-validation in each of 3 examined metrics.ConclusionsThis analysis explores a novel approach for comparisons of model selection approaches in real genomic data from 5 cancers. Our benchmarking datasets have been made publicly available for use in future research. Our findings support the use of L0L2 penalisation for structural selection and L1L2 penalisation for coefficient recovery in genomic data. Evaluation of learning algorithms according to observed test performance in external genomic datasets yields valuable insights into actual test performance, providing a data-driven complement to internal cross-validation in genomic regression tasks.

Dataset Information

VARIABLE SELECTION AND REGRESSION ANALYSIS FOR GRAPH-STRUCTURED COVARIATES WITH AN APPLICATION TO GENOMICS.

Publications

VARIABLE SELECTION AND REGRESSION ANALYSIS FOR GRAPH-STRUCTURED COVARIATES WITH AN APPLICATION TO GENOMICS.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets