Dataset Information

Kernel based methods for accelerated failure time model with ultra-high dimensional data.

ABSTRACT:

Background

Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with L₁ and L(p) penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n << m (the number of genes), directly identifying a small subset of genes from ultra-high (m > 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is select a couple of thousands (or hundreds) of genes using univariate analysis or statistical tests, and then apply the LASSO-type penalty to further reduce the number of disease associated genes. This two-step procedure may introduce bias and inaccuracy and lead us to miss biologically important genes.

Results

The accelerated failure time (AFT) model is a linear regression model and a useful alternative to the Cox model for survival analysis. In this paper, we propose a nonlinear kernel based AFT model and an efficient variable selection method with adaptive kernel ridge regression. Our proposed variable selection method is based on the kernel matrix and dual problem with a much smaller n x n matrix. It is very efficient when the number of unknown variables (genes) is much larger than the number of samples. Moreover, the primal variables are explicitly updated and the sparsity in the solution is exploited.

Conclusions

Our proposed methods can simultaneously identify survival associated prognostic factors and predict survival outcomes with ultra-high dimensional genomic data. We have demonstrated the performance of our methods with both simulation and real data. The proposed method performs superbly with limited computational studies.

SUBMITTER: Liu Z

PROVIDER: S-EPMC3019227 | biostudies-literature | 2010 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Kernel based methods for accelerated failure time model with ultra-high dimensional data.

Liu Zhenqiu Z Chen Dechang D Tan Ming M Jiang Feng F Gartenhaus Ronald B RB

BMC bioinformatics 20101221

<h4>Background</h4>Most genomic data have ultra-high dimensions with more than 10,000 genes (probes). Regularization methods with L₁ and L(p) penalty have been extensively studied in survival analysis with high-dimensional genomic data. However, when the sample size n << m (the number of genes), directly identifying a small subset of genes from ultra-high (m > 10, 000) dimensional data is time-consuming and not computationally efficient. In current microarray analysis, what people really do is s ...[more]

PMID: 21176134

Similar Datasets

Project description:Background:Building prognostic models of clinical outcomes is an increasingly important research task and will remain a vital area in genomic medicine. Prognostic models of clinical outcomes are usually built and validated utilizing variable selection methods and machine learning tools. The challenges, however, in ultra-high dimensional space are not only to reduce the dimensionality of the data, but also to retain the important variables which predict the outcome. Screening approaches, such as the sure independence screening (SIS), iterative SIS (ISIS) and principled SIS (PSIS) have been developed to overcome the challenge of high dimensionality. We are interested in identifying important single-nucleotide polymorphisms (SNPs) and integrating them into a validated prognostic model of overall survival in patients with metastatic prostate cancer. While the abovementioned variable selection approaches have theoretical justification in selecting SNPs, the comparison and the performance of these combined methods in predicting time-to-event outcomes have not been previously studied in ultra-high dimensional space with hundreds of thousands of variables. Methods:We conducted a series of simulations to compare the performance of different combinations of variable selection approaches and classification trees, such as the least absolute shrinkage and selection operator (LASSO), adaptive least absolute shrinkage and selection operator (ALASSO) and random survival forest (RSF), in ultra-high dimensional setting data for the purpose of developing prognostic models for a time-to-event outcome that is subject to censoring. The variable selection methods were evaluated for discrimination (Harrell's concordance statistic), calibration and overall performance. In addition, we applied these approaches to 498,081 SNPs from 623 Caucasian patients with prostate cancer. Results:When n=300, ISIS-LASSO and ISIS-ALASSO chose all the informative variables which resulted in the highest Harrell's c-index (>0.80). On the other hand, with a small sample size (n=150), ALASSO performed better than any other combinations as demonstrated by the highest c-index and/or overall performance, although there was evidence of overfitting. In analyzing the prostate cancer data, ISIS-ALASSO, SIS-LASSO, and SIS-ALASSO combinations achieved the highest discrimination with c-index of 0.67. Conclusions:Choosing the appropriate variable selection method for training a model is a critical step in developing a robust prognostic model. Based on the simulation studies, the effective use of ALASSO or a combination of methods, such as ISIS-LASSO and ISIS-ALASSO, allows both for the development of prognostic models with high predictive accuracy and a low risk of overfitting assuming moderate sample sizes.

Dataset Information

Kernel based methods for accelerated failure time model with ultra-high dimensional data.

Background

Results

Conclusions

Publications

Kernel based methods for accelerated failure time model with ultra-high dimensional data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets