Project description:Radiomics involves the study of tumor images to identify quantitative markers explaining cancer heterogeneity. The predominant approach is to extract hundreds to thousands of image features, including histogram features comprised of summaries of the marginal distribution of pixel intensities, which leads to multiple testing problems and can miss out on insights not contained in the selected features. In this paper, we present methods to model the entire marginal distribution of pixel intensities via the quantile function as functional data, regressed on a set of demographic, clinical, and genetic predictors to investigate their effects of imaging-based cancer heterogeneity. We call this approach quantile functional regression, regressing subject-specific marginal distributions across repeated measurements on a set of covariates, allowing us to assess which covariates are associated with the distribution in a global sense, as well as to identify distributional features characterizing these differences, including mean, variance, skewness, heavy-tailedness, and various upper and lower quantiles. To account for smoothness in the quantile functions, account for intrafunctional correlation, and gain statistical power, we introduce custom basis functions we call quantlets that are sparse, regularized, near-lossless, and empirically defined, adapting to the features of a given data set and containing a Gaussian subspace so non-Gaussianness can be assessed. We fit this model using a Bayesian framework that uses nonlinear shrinkage of quantlet coefficients to regularize the functional regression coefficients and provides fully Bayesian inference after fitting a Markov chain Monte Carlo. We demonstrate the benefit of the basis space modeling through simulation studies, and apply the method to Magnetic resonance imaging (MRI) based radiomic dataset from Glioblastoma Multiforme to relate imaging-based quantile functions to various demographic, clinical, and genetic predictors, finding specific differences in tumor pixel intensity distribution between males and females and between tumors with and without DDIT3 mutations.
Project description:This article develops a novel spatial quantile function-on-scalar regression model, which studies the conditional spatial distribution of a high-dimensional functional response given scalar predictors. With the strength of both quantile regression and copula modeling, we are able to explicitly characterize the conditional distribution of the functional or image response on the whole spatial domain. Our method provides a comprehensive understanding of the effect of scalar covariates on functional responses across different quantile levels and also gives a practical way to generate new images for given covariate values. Theoretically, we establish the minimax rates of convergence for estimating coefficient functions under both fixed and random designs. We further develop an efficient primal-dual algorithm to handle high-dimensional image data. Simulations and real data analysis are conducted to examine the finite-sample performance.
Project description:BackgroundMass spectrometry (MS) data are often generated from various biological or chemical experiments and there may exist outlying observations, which are extreme due to technical reasons. The determination of outlying observations is important in the analysis of replicated MS data because elaborate pre-processing is essential for successful analysis with reliable results and manual outlier detection as one of pre-processing steps is time-consuming. The heterogeneity of variability and low replication are often obstacles to successful analysis, including outlier detection. Existing approaches, which assume constant variability, can generate many false positives (outliers) and/or false negatives (non-outliers). Thus, a more powerful and accurate approach is needed to account for the heterogeneity of variability and low replication.FindingsWe proposed an outlier detection algorithm using projection and quantile regression in MS data from multiple experiments. The performance of the algorithm and program was demonstrated by using both simulated and real-life data. The projection approach with linear, nonlinear, or nonparametric quantile regression was appropriate in heterogeneous high-throughput data with low replication.ConclusionVarious quantile regression approaches combined with projection were proposed for detecting outliers. The choice among linear, nonlinear, and nonparametric regressions is dependent on the degree of heterogeneity of the data. The proposed approach was illustrated with MS data with two or more replicates.
Project description:We propose a class of estimation techniques for scalar-on-function regression where both outcomes and functional predictors may be observed at multiple visits. Our methods are motivated by a longitudinal brain diffusion tensor imaging tractography study. One of the study's primary goals is to evaluate the contemporaneous association between human function and brain imaging over time. The complexity of the study requires the development of methods that can simultaneously incorporate: (1) multiple functional (and scalar) regressors; (2) longitudinal outcome and predictor measurements per patient; (3) Gaussian or non-Gaussian outcomes; and (4) missing values within functional predictors. We propose two versions of a new method, longitudinal functional principal components regression (PCR). These methods extend the well-known functional PCR and allow for different effects of subject-specific trends in curves and of visit-specific deviations from that trend. The new methods are compared with existing approaches, and the most promising techniques are used for analyzing the tractography data.
Project description:Prediction of subject age from brain anatomical MRI has the potential to provide a sensitive summary of brain changes, indicative of different neurodegenerative diseases. However, existing studies typically neglect the uncertainty of these predictions. In this work we take into account this uncertainty by applying methods of functional data analysis. We propose a penalised functional quantile regression model of age on brain structure with cognitively normal (CN) subjects in the Alzheimer's Disease Neuroimaging Initiative (ADNI), and use it to predict brain age in Mild Cognitive Impairment (MCI) and Alzheimer's Disease (AD) subjects. Unlike the machine learning approaches available in the literature of brain age prediction, which provide only point predictions, the outcome of our model is a prediction interval for each subject.
Project description:For regression models with functional responses and scalar predictors, it is common for the number of predictors to be large. Despite this, few methods for variable selection exist for function-on-scalar models, and none account for the inherent correlation of residual curves in such models. By expanding the coefficient functions using a B-spline basis, we pose the function-on-scalar model as a multivariate regression problem. Spline coefficients are grouped within coefficient function, and group-minimax concave penalty (MCP) is used for variable selection. We adapt techniques from generalized least squares to account for residual covariance by "pre-whitening" using an estimate of the covariance matrix, and establish theoretical properties for the resulting estimator. We further develop an iterative algorithm that alternately updates the spline coefficients and covariance; simulation results indicate that this iterative algorithm often performs as well as pre-whitening using the true covariance, and substantially outperforms methods that neglect the covariance structure. We apply our method to two-dimensional planar reaching motions in a study of the effects of stroke severity on motor control, and find that our method provides lower prediction errors than competing methods.
Project description:Many bioinformatics tools are available for the quantitative analysis of proteomics experiments. Most of these tools use a dedicated statistical model to derive absolute quantitative protein values from mass spectrometry (MS) data. Here, we present iSanXoT, a standalone application that processes relative abundances between MS signals and then integrates them sequentially to upper levels using the previously published Generic Integration Algorithm (GIA). iSanXoT offers unique capabilities that complement conventional quantitative software applications, including statistical weighting and independent modeling of error distributions in each integration, aggregation of technical or biological replicates, quantification of posttranslational modifications, and analysis of coordinated protein behavior. iSanXoT is a standalone, user-friendly application that accepts output from popular proteomics pipelines and enables unrestricted creation of quantification workflows and fully customizable reports that can be reused across projects or shared among users. Numerous publications attest the successful application of diverse integrative workflows constructed using the GIA for the analysis of high-throughput quantitative proteomics experiments. iSanXoT has been tested with the main operating systems. Download links for the corresponding distributions are available at https://github.com/CNIC-Proteomics/iSanXoT/releases.
Project description:Additive models are flexible regression tools that handle linear as well as non-linear terms. The latter are typically modelled via smoothing splines. Additive mixed models extend additive models to include random terms when the data are sampled according to cluster designs (e.g. longitudinal).These models find applications in the study of phenomena like growth, certain disease mechanisms and energy expenditure in humans, when repeated measurements are available. We propose a novel additive mixed model for quantile regression. Our methods are motivated by an application to physical activity based on a data set with more than half a million accelerometer measurements in children of the UK Millennium Cohort Study. In a simulation study, we assess the proposed methods against existing alternatives.
Project description:The open XML format mzML, used for representation of MS data, is pivotal for the development of platform-independent MS analysis software. Although conversion from vendor formats to mzML must take place on a platform on which the vendor libraries are available (i.e. Windows), once mzML files have been generated, they can be used on any platform. However, the mzML format has turned out to be less efficient than vendor formats. In many cases, the naïve mzML representation is fourfold or even up to 18-fold larger compared with the original vendor file. In disk I/O limited setups, a larger data file also leads to longer processing times, which is a problem given the data production rates of modern mass spectrometers. In an attempt to reduce this problem, we here present a family of numerical compression algorithms called MS-Numpress, intended for efficient compression of MS data. To facilitate ease of adoption, the algorithms target the binary data in the mzML standard, and support in main proteomics tools is already available. Using a test set of 10 representative MS data files we demonstrate typical file size decreases of 90% when combined with traditional compression, as well as read time decreases of up to 50%. It is envisaged that these improvements will be beneficial for data handling within the MS community.