Dataset Information

Improved high-dimensional prediction with Random Forests by the use of co-data.

ABSTRACT: BACKGROUND:Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting. RESULTS:Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information that is available on the variables of the primary data, but does not use its response labels. These moderated sampling probabilities are, inspired by empirical Bayes, learned from the data at hand. We demonstrate the co-data moderated Random Forest (CoRF) with two examples. In the first example we aim to predict the presence of a lymph node metastasis with gene expression data. We demonstrate how a set of external p-values, a gene signature, and the correlation between gene expression and DNA copy number can improve the predictive performance. In the second example we demonstrate how the prediction of cervical (pre-)cancer with methylation data can be improved by including the location of the probe relative to the known CpG islands, the number of CpG sites targeted by a probe, and a set of p-values from a related study. CONCLUSION:The proposed method is able to utilize auxiliary co-data to improve the performance of a Random Forest.

SUBMITTER: Te Beest DE

PROVIDER: S-EPMC5745983 | biostudies-literature | 2017 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improved high-dimensional prediction with Random Forests by the use of co-data.

Te Beest Dennis E DE Mes Steven W SW Wilting Saskia M SM Brakenhoff Ruud H RH van de Wiel Mark A MA

BMC bioinformatics 20171228 1

<h4>Background</h4>Prediction in high dimensional settings is difficult due to the large number of variables relative to the sample size. We demonstrate how auxiliary 'co-data' can be used to improve the performance of a Random Forest in such a setting.<h4>Results</h4>Co-data are incorporated in the Random Forest by replacing the uniform sampling probabilities that are used to draw candidate variables by co-data moderated sampling probabilities. Co-data here are defined as any type information t ...[more]

PMID: 29281963

Similar Datasets

Project description:BackgroundHigh-dimensional prediction considers data with more variables than samples. Generic research goals are to find the best predictor or to select variables. Results may be improved by exploiting prior information in the form of co-data, providing complementary data not on the samples, but on the variables. We consider adaptive ridge penalised generalised linear and Cox models, in which the variable-specific ridge penalties are adapted to the co-data to give a priori more weight to more important variables. The R-package ecpc originally accommodated various and possibly multiple co-data sources, including categorical co-data, i.e. groups of variables, and continuous co-data. Continuous co-data, however, were handled by adaptive discretisation, potentially inefficiently modelling and losing information. As continuous co-data such as external p values or correlations often arise in practice, more generic co-data models are needed.ResultsHere, we present an extension to the method and software for generic co-data models, particularly for continuous co-data. At the basis lies a classical linear regression model, regressing prior variance weights on the co-data. Co-data variables are then estimated with empirical Bayes moment estimation. After placing the estimation procedure in the classical regression framework, extension to generalised additive and shape constrained co-data models is straightforward. Besides, we show how ridge penalties may be transformed to elastic net penalties. In simulation studies we first compare various co-data models for continuous co-data from the extension to the original method. Secondly, we compare variable selection performance to other variable selection methods. The extension is faster than the original method and shows improved prediction and variable selection performance for non-linear co-data relations. Moreover, we demonstrate use of the package in several genomics examples throughout the paper.ConclusionsThe R-package ecpc accommodates linear, generalised additive and shape constrained additive co-data models for the purpose of improved high-dimensional prediction and variable selection. The extended version of the package as presented here (version number 3.1.1 and higher) is available on ( https://cran.r-project.org/web/packages/ecpc/ ).

Project description:Combined focused ion beam and scanning electron microscope (FIB-SEM) tomography is a well-established technique for high resolution imaging and reconstruction of the microstructure of a wide range of materials. Segmentation of FIB-SEM data is complicated due to a number of factors; the most prominent is that for porous materials, the scanning electron microscope image slices contain information not only from the planar cross-section of the material but also from underlying, exposed subsurface pores. In this work, we develop a segmentation method for FIB-SEM data from ethyl cellulose porous films made from ethyl cellulose and hydroxypropyl cellulose (EC/HPC) polymer blends. These materials are used for coating pharmaceutical oral dosage forms (tablets or pellets) to control drug release. We study three samples of ethyl cellulose and hydroxypropyl cellulose with different volume fractions where the hydroxypropyl cellulose phase has been leached out, resulting in a porous material. The data are segmented using scale-space features and a random forest classifier. We demonstrate good agreement with manual segmentations. The method enables quantitative characterization and subsequent optimization of material structure for controlled release applications. Although the methodology is demonstrated on porous polymer films, it is applicable to other soft porous materials imaged by FIB-SEM. We make the data and software used publicly available to facilitate further development of FIB-SEM segmentation methods. LAY DESCRIPTION: For imaging of very fine structures in materials, the resolution limits of, e.g. X-ray computed tomography quickly become a bottleneck. Scanning electron microscopy (SEM) provides a way out, but it is essentially a two-dimensional imaging technique. One manner in which to extend it to three dimensions is to use a focused ion beam (FIB) combined with a scanning electron microscopy and acquire tomography data. In FIB-SEM tomography, ions are used to perform serial sectioning and the electron beam is used to image the cross section surface. This is a well-established method for a wide range of materials. However, image analysis of FIB-SEM data is complicated for a variety of reasons, in particular for porous media. In this work, we analyse FIB-SEM data from ethyl cellulose porous films made from ethyl cellulose and hydroxypropyl cellulose (EC/HPC) polymer blends. These films are used as coatings for controlled drug release. The aim is to perform image segmentation, i.e. to identify which parts of the image data constitute the pores and the solid, respectively. Manual segmentation, i.e. when a trained operator manually identifies areas constituting pores and solid, is too time-consuming to do in full for our very large data sets. However, by performing manual segmentation on a set of small, random regions of the data, we can train a machine learning algorithm to perform automatic segmentation on the entire data sets. The method yields good agreement with the manual segmentations and yields porosities of the entire data sets in very good agreement with expected values. The method facilitates understanding and quantitative characterization of the geometrical structure of the materials, and ultimately understanding of how to tailor the drug release.

Project description:BackgroundClinical research and medical practice can be advanced through the prediction of an individual's health state, trajectory, and responses to treatments. However, the majority of current clinical risk prediction models are based on regression approaches or machine learning algorithms that are static, rather than dynamic. To benefit from the increasing emergence of large, heterogeneous data sets, such as electronic health records (EHRs), novel tools to support improved clinical decision making through methods for individual-level risk prediction that can handle multiple variables, their interactions, and time-varying values are necessary.MethodsWe introduce a novel dynamic approach to clinical risk prediction for survival, longitudinal, and multivariate (SLAM) outcomes, called random forest for SLAM data analysis (RF-SLAM). RF-SLAM is a continuous-time, random forest method for survival analysis that combines the strengths of existing statistical and machine learning methods to produce individualized Bayes estimates of piecewise-constant hazard rates. We also present a method-agnostic approach for time-varying evaluation of model performance.ResultsWe derive and illustrate the method by predicting sudden cardiac arrest (SCA) in the Left Ventricular Structural (LV) Predictors of Sudden Cardiac Death (SCD) Registry. We demonstrate superior performance relative to standard random forest methods for survival data. We illustrate the importance of the number of preceding heart failure hospitalizations as a time-dependent predictor in SCA risk assessment.ConclusionsRF-SLAM is a novel statistical and machine learning method that improves risk prediction by incorporating time-varying information and accommodating a large number of predictors, their interactions, and missing values. RF-SLAM is designed to easily extend to simultaneous predictions of multiple, possibly competing, events and/or repeated measurements of discrete or continuous variables over time.Trial registrationLV Structural Predictors of SCD Registry (clinicaltrials.gov, NCT01076660), retrospectively registered 25 February 2010.

Dataset Information

Improved high-dimensional prediction with Random Forests by the use of co-data.

Publications

Improved high-dimensional prediction with Random Forests by the use of co-data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets