Project description:BackgroundNormalization is essential in dual-labelled microarray data analysis to remove non-biological variations and systematic biases. Many normalization methods have been used to remove such biases within slides (Global, Lowess) and across slides (Scale, Quantile and VSN). However, all these popular approaches have critical assumptions about data distribution, which is often not valid in practice.ResultsIn this study, we propose a novel assumption-free normalization method based on the Generalized Procrustes Analysis (GPA) algorithm. Using experimental and simulated normal microarray data and boutique array data, we systemically evaluate the ability of the GPA method in normalization compared with six other popular normalization methods including Global, Lowess, Scale, Quantile, VSN, and one boutique array-specific housekeeping gene method. The assessment of these methods is based on three different empirical criteria: across-slide variability, the Kolmogorov-Smirnov (K-S) statistic and the mean square error (MSE). Compared with other methods, the GPA method performs effectively and consistently better in reducing across-slide variability and removing systematic bias.ConclusionThe GPA method is an effective normalization approach for microarray data analysis. In particular, it is free from the statistical and biological assumptions inherent in other normalization methods that are often difficult to validate. Therefore, the GPA method has a major advantage in that it can be applied to diverse types of array sets, especially to the boutique array where the majority of genes may be differentially expressed.
Project description:In prognostic studies, the lasso technique is attractive since it improves the quality of predictions by shrinking regression coefficients, compared to predictions based on a model fitted via unpenalized maximum likelihood. Since some coefficients are set to zero, parsimony is achieved as well. It is unclear whether the performance of a model fitted using the lasso still shows some optimism. Bootstrap methods have been advocated to quantify optimism and generalize model performance to new subjects. It is unclear how resampling should be performed in the presence of multiply imputed data.The data were based on a cohort of Chronic Obstructive Pulmonary Disease patients. We constructed models to predict Chronic Respiratory Questionnaire dyspnea 6 months ahead. Optimism of the lasso model was investigated by comparing 4 approaches of handling multiply imputed data in the bootstrap procedure, using the study data and simulated data sets. In the first 3 approaches, data sets that had been completed via multiple imputation (MI) were resampled, while the fourth approach resampled the incomplete data set and then performed MI.The discriminative model performance of the lasso was optimistic. There was suboptimal calibration due to over-shrinkage. The estimate of optimism was sensitive to the choice of handling imputed data in the bootstrap resampling procedure. Resampling the completed data sets underestimates optimism, especially if, within a bootstrap step, selected individuals differ over the imputed data sets. Incorporating the MI procedure in the validation yields estimates of optimism that are closer to the true value, albeit slightly too larger.Performance of prognostic models constructed using the lasso technique can be optimistic as well. Results of the internal validation are sensitive to how bootstrap resampling is performed.
Project description:This manuscript considers regression models for generalized, multilevel functional responses: functions are generalized in that they follow an exponential family distribution and multilevel in that they are clustered within groups or subjects. This data structure is increasingly common across scientific domains and is exemplified by our motivating example, in which binary curves indicating physical activity or inactivity are observed for nearly 600 subjects over 5 days. We use a generalized linear model to incorporate scalar covariates into the mean structure, and decompose subject-specific and subject-day-specific deviations using multilevel functional principal components analysis. Thus, functional fixed effects are estimated while accounting for within-function and within-subject correlations, and major directions of variability within and between subjects are identified. Fixed effect coefficient functions and principal component basis functions are estimated using penalized splines; model parameters are estimated in a Bayesian framework using Stan, a programming language that implements a Hamiltonian Monte Carlo sampler. Simulations designed to mimic the application have good estimation and inferential properties with reasonable computation times for moderate datasets, in both cross-sectional and multilevel scenarios; code is publicly available. In the application we identify effects of age and BMI on the time-specific change in probability of being active over a 24-hour period; in addition, the principal components analysis identifies the patterns of activity that distinguish subjects and days within subjects.
Project description:Whenever statistical analyses are applied to multiply imputed datasets, specific formulas are needed to combine the results into one overall analysis, also called combination rules. In the context of regression analysis, combination rules for the unstandardized regression coefficients, the t-tests of the regression coefficients, and the F-tests for testing [Formula: see text] for significance have long been established. However, there is still no general agreement on how to combine the point estimators of [Formula: see text] in multiple regression applied to multiply imputed datasets. Additionally, no combination rules for standardized regression coefficients and their confidence intervals seem to have been developed at all. In the current article, two sets of combination rules for the standardized regression coefficients and their confidence intervals are proposed, and their statistical properties are discussed. Additionally, two improved point estimators of [Formula: see text] in multiply imputed data are proposed, which in their computation use the pooled standardized regression coefficients. Simulations show that the proposed pooled standardized coefficients produce only small bias and that their 95% confidence intervals produce coverage close to the theoretical 95%. Furthermore, the simulations show that the newly proposed pooled estimates for [Formula: see text] are less biased than two earlier proposed pooled estimates.
Project description:Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ≥50% ) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1-se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1-se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets.
Project description:This Teaching Resource provides lecture notes, slides, and a problem set for a series of lectures from a course entitled "Systems Biology: Biomedical Modeling." The materials are a lecture introducing the mathematical concepts behind principal components analysis (PCA). The lecture describes how to handle large data sets with correlation methods and unsupervised clustering with this popular method of analysis, PCA.
Project description:The APPRIS database (http://appris-tools.org) uses protein structural and functional features and information from cross-species conservation to annotate splice isoforms in protein-coding genes. APPRIS selects a single protein isoform, the 'principal' isoform, as the reference for each gene based on these annotations. A single main splice isoform reflects the biological reality for most protein coding genes and APPRIS principal isoforms are the best predictors of these main proteins isoforms. Here, we present the updates to the database, new developments that include the addition of three new species (chimpanzee, Drosophila melangaster and Caenorhabditis elegans), the expansion of APPRIS to cover the RefSeq gene set and the UniProtKB proteome for six species and refinements in the core methods that make up the annotation pipeline. In addition APPRIS now provides a measure of reliability for individual principal isoforms and updates with each release of the GENCODE/Ensembl and RefSeq reference sets. The individual GENCODE/Ensembl, RefSeq and UniProtKB reference gene sets for six organisms have been merged to produce common sets of splice variants.
Project description:BackgroundA key question when analyzing high throughput data is whether the information provided by the measured biological entities (gene, metabolite expression for example) is related to the experimental conditions, or, rather, to some interfering signals, such as experimental bias or artefacts. Visualization tools are therefore useful to better understand the underlying structure of the data in a 'blind' (unsupervised) way. A well-established technique to do so is Principal Component Analysis (PCA). PCA is particularly powerful if the biological question is related to the highest variance. Independent Component Analysis (ICA) has been proposed as an alternative to PCA as it optimizes an independence condition to give more meaningful components. However, neither PCA nor ICA can overcome both the high dimensionality and noisy characteristics of biological data.ResultsWe propose Independent Principal Component Analysis (IPCA) that combines the advantages of both PCA and ICA. It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data. The result is a better clustering of the biological samples on graphical representations. In addition, a sparse version is proposed that performs an internal variable selection to identify biologically relevant features (sIPCA).ConclusionsOn simulation studies and real data sets, we showed that IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA. Furthermore, a preliminary investigation of the list of genes selected with sIPCA demonstrate that the approach is well able to highlight relevant genes in the data with respect to the biological experiment.IPCA and sIPCA are both implemented in the R package mixomics dedicated to the analysis and exploration of high dimensional biological data sets, and on mixomics' web-interface.
Project description:We present Procrustean Approach to Cophylogeny (PACo), a novel statistical tool to test for congruence between phylogenetic trees, or between phylogenetic distance matrices of associated taxa. Unlike previous tests, PACo evaluates the dependence of one phylogeny upon the other. This makes it especially appropriate to test the classical coevolutionary model that assumes that parasites that spend part of their life in or on their hosts track the phylogeny of their hosts. The new method does not require fully resolved phylogenies and allows for multiple host-parasite associations. PACo produces a Procrustes superimposition plot enabling a graphical assessment of the fit of the parasite phylogeny onto the host phylogeny and a goodness-of-fit statistic, whose significance is established by randomization of the host-parasite association data. The contribution of each individual host-parasite association to the global fit is measured by means of jackknife estimation of their respective squared residuals and confidence intervals associated to each host-parasite link. We carried out different simulations to evaluate the performance of PACo in terms of Type I and Type II errors with respect to two similar published tests. In most instances, PACo performed at least as well as the other tests and showed higher overall statistical power. In addition, the jackknife estimation of squared residuals enabled more elaborate validations about the nature of individual links than the ParaFitLink1 test of the program ParaFit. In order to demonstrate how it can be used in real biological situations, we applied PACo to two published studies using a script written in the public-domain statistical software R.
Project description:The influence of livestock production system [Freedom Extensive System (FES) versus Semi-Extensive System (SES)] and finishing feed (1.5 kg versus 3.0 kg of commercial feed in the finishing period) diet on sensory properties of foal meat using Generalized Procrustes Analysis (GPA) was studied. For this work, a total of 24 foals (8 from FES and 16 from SES) were used. Samples were evaluated by eight panelists for eight sensory attributes: colour, marbling, odour intensity, sweetness, springiness, hardness, chewiness, and juiciness. Data were analyzed using a GPA to minimize differences among testers. Highly appreciated sensory properties (odour intensity, red colour, marbling, and juiciness) were mostly associated with foals from the Semi-Extensive System. On the other hand, the three groups studied (FES, 1.5SES, and 3SES) were clearly recognized by panelists on the consensus configuration and they were clearly separated on the map. This study concluded that sensory characteristics of foal meat from a Semi-Extensive Production System with a finishing diet of 3 kg concentrate were more preferable than the other ones.