Dataset Information

Unified methods for feature selection in large-scale genomic studies with censored survival outcomes.

ABSTRACT:

Motivation

One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous datasets where the number of genomic features is far greater than the number of subjects. Methods based on univariate Cox regression are often used to select genomic features related to survival outcome; however, the Cox model assumes proportional hazards (PH), which is unlikely to hold for each feature. When applied to genomic features exhibiting some form of non-proportional hazards (NPH), these methods could lead to an under- or over-estimation of the effects. We propose a broad array of marginal screening techniques that aid in feature ranking and selection by accommodating various forms of NPH. First, we develop an approach based on Kullback-Leibler information divergence and the Yang-Prentice model that includes methods for the PH and proportional odds (PO) models as special cases. Next, we propose R2 measures for the PH and PO models that can be interpreted in terms of explained randomness. Lastly, we propose a generalized pseudo-R2 index that includes PH, PO, crossing hazards and crossing odds models as special cases and can be interpreted as the percentage of separability between subjects experiencing the event and not experiencing the event according to feature measurements.

Results

We evaluate the performance of our measures using extensive simulation studies and publicly available datasets in cancer genomics. We demonstrate that the proposed methods successfully address the issue of NPH in genomic feature selection and outperform existing methods.

Availability and implementation

R code for the proposed methods is available at github.com/lburns27/Feature-Selection.

Contact

karthik.devarajan@fccc.edu.

Supplementary information

Supplementary data are available at Bioinformatics online.

SUBMITTER: Spirko-Burns L

PROVIDER: S-EPMC7267818 | biostudies-literature | 2020 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Unified methods for feature selection in large-scale genomic studies with censored survival outcomes.

Spirko-Burns Lauren L Devarajan Karthik K

Bioinformatics (Oxford, England) 20200601 11

<h4>Motivation</h4>One of the major goals in large-scale genomic studies is to identify genes with a prognostic impact on time-to-event outcomes which provide insight into the disease process. With rapid developments in high-throughput genomic technologies in the past two decades, the scientific community is able to monitor the expression levels of tens of thousands of genes and proteins resulting in enormous datasets where the number of genomic features is far greater than the number of subject ...[more]

PMID: 32154833

Similar Datasets

Project description:Multi-omics data, that is, datasets containing different types of high-dimensional molecular variables, are increasingly often generated for the investigation of various diseases. Nevertheless, questions remain regarding the usefulness of multi-omics data for the prediction of disease outcomes such as survival time. It is also unclear which methods are most appropriate to derive such prediction models. We aim to give some answers to these questions through a large-scale benchmark study using real data. Different prediction methods from machine learning and statistics were applied on 18 multi-omics cancer datasets (35 to 1000 observations, up to 100 000 variables) from the database 'The Cancer Genome Atlas' (TCGA). The considered outcome was the (censored) survival time. Eleven methods based on boosting, penalized regression and random forest were compared, comprising both methods that do and that do not take the group structure of the omics variables into account. The Kaplan-Meier estimate and a Cox model using only clinical variables were used as reference methods. The methods were compared using several repetitions of 5-fold cross-validation. Uno's C-index and the integrated Brier score served as performance metrics. The results indicate that methods taking into account the multi-omics structure have a slightly better prediction performance. Taking this structure into account can protect the predictive information in low-dimensional groups-especially clinical variables-from not being exploited during prediction. Moreover, only the block forest method outperformed the Cox model on average, and only slightly. This indicates, as a by-product of our study, that in the considered TCGA studies the utility of multi-omics data for prediction purposes was limited. Contact:moritz.herrmann@stat.uni-muenchen.de, +49 89 2180 3198 Supplementary information: Supplementary data are available at Briefings in Bioinformatics online. All analyses are reproducible using R code freely available on Github.

Dataset Information

Unified methods for feature selection in large-scale genomic studies with censored survival outcomes.

Motivation

Results

Availability and implementation

Contact

Supplementary information

Publications

Unified methods for feature selection in large-scale genomic studies with censored survival outcomes.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets