Project description:Continuous threshold regression is a common type of nonlinear regression that is attractive to many practitioners for its easy interpretability. More widespread adoption of thresh-old regression faces two challenges: (i) the computational complexity of fitting threshold regression models and (ii) obtaining correct coverage of confidence intervals under model misspecification. Both challenges result from the non-smooth and non-convex nature of the threshold regression model likelihood function. In this paper we first show that these two issues together make the ideal approach for making model-robust inference in continuous threshold linear regression an impractical one. The need for a faster way of fitting continuous threshold linear models motivated us to develop a fast grid search method. The new method, based on the simple yet powerful dynamic programming principle, improves the performance by several orders of magnitude.
Project description:Several research fields frequently deal with the analysis of diverse classification results of the same entities. This should imply an objective detection of overlaps and divergences between the formed clusters. The congruence between classifications can be quantified by clustering agreement measures, including pairwise agreement measures. Several measures have been proposed and the importance of obtaining confidence intervals for the point estimate in the comparison of these measures has been highlighted. A broad range of methods can be used for the estimation of confidence intervals. However, evidence is lacking about what are the appropriate methods for the calculation of confidence intervals for most clustering agreement measures. Here we evaluate the resampling techniques of bootstrap and jackknife for the calculation of the confidence intervals for clustering agreement measures. Contrary to what has been shown for some statistics, simulations showed that the jackknife performs better than the bootstrap at accurately estimating confidence intervals for pairwise agreement measures, especially when the agreement between partitions is low. The coverage of the jackknife confidence interval is robust to changes in cluster number and cluster size distribution.
Project description:Estimation of heritability is fundamental in genetic studies. Recently, heritability estimation using linear mixed models (LMMs) has gained popularity because these estimates can be obtained from unrelated individuals collected in genome-wide association studies. Typically, heritability estimation under LMMs uses the restricted maximum likelihood (REML) approach. Existing methods for the construction of confidence intervals and estimators of SEs for REML rely on asymptotic properties. However, these assumptions are often violated because of the bounded parameter space, statistical dependencies, and limited sample size, leading to biased estimates and inflated or deflated confidence intervals. Here, we show that the estimation of confidence intervals by state-of-the-art methods is inaccurate, especially when the true heritability is relatively low or relatively high. We further show that these inaccuracies occur in datasets including thousands of individuals. Such biases are present, for example, in estimates of heritability of gene expression in the Genotype-Tissue Expression project and of lipid profiles in the Ludwigshafen Risk and Cardiovascular Health study. We also show that often the probability that the genetic component is estimated as 0 is high even when the true heritability is bounded away from 0, emphasizing the need for accurate confidence intervals. We propose a computationally efficient method, ALBI (accurate LMM-based heritability bootstrap confidence intervals), for estimating the distribution of the heritability estimator and for constructing accurate confidence intervals. Our method can be used as an add-on to existing methods for estimating heritability and variance components, such as GCTA, FaST-LMM, GEMMA, or EMMAX.
Project description:In many scientific studies, the underlying data-generating process is unknown and multiple statistical models are considered to describe it. For example, in a factorial experiment we might consider models involving just main effects, as well as those that include interactions. Model-averaging is a commonly-used statistical technique to allow for model uncertainty in parameter estimation. In the frequentist setting, the model-averaged estimate of a parameter is a weighted mean of the estimates from the individual models, with the weights typically being based on an information criterion, cross-validation, or bootstrapping. One approach to building a model-averaged confidence interval is to use a Wald interval, based on the model-averaged estimate and its standard error. This has been the default method in many application areas, particularly those in the life sciences. The MA-Wald interval, however, assumes that the studentized model-averaged estimate has a normal distribution, which can be far from true in practice due to the random, data-driven model weights. Recently, the model-averaged tail area Wald interval (MATA-Wald) has been proposed as an alternative to the MA-Wald interval, which only assumes that the studentized estimate from each model has a N(0, 1) or t-distribution, when that model is true. This alternative to the MA-Wald interval has been shown to have better coverage in simulation studies. However, when we have a response variable that is skewed, even these relaxed assumptions may not be valid, and use of these intervals might therefore result in poor coverage. We propose a new interval (MATA-SBoot) which uses a parametric bootstrap approach to estimate the distribution of the studentized estimate for each model, when that model is true. This method only requires that the studentized estimate from each model is approximately pivotal, an assumption that will often be true in practice, even for skewed data. We illustrate use of this new interval in the analysis of a three-factor marine global change experiment in which the response variable is assumed to have a lognormal distribution. We also perform a simulation study, based on the example, to compare the lower and upper error rates of this interval with those for existing methods. The results suggest that the MATA-SBoot interval can provide better error rates than existing intervals when we have skewed data, particularly for the upper error rate when the sample size is small.
Project description:Interval estimates - estimates of parameters that include an allowance for sampling uncertainty - have long been touted as a key component of statistical analyses. There are several kinds of interval estimates, but the most popular are confidence intervals (CIs): intervals that contain the true parameter value in some known proportion of repeated samples, on average. The width of confidence intervals is thought to index the precision of an estimate; CIs are thought to be a guide to which parameter values are plausible or reasonable; and the confidence coefficient of the interval (e.g., 95 %) is thought to index the plausibility that the true parameter is included in the interval. We show in a number of examples that CIs do not necessarily have any of these properties, and can lead to unjustified or arbitrary inferences. For this reason, we caution against relying upon confidence interval theory to justify interval estimates, and suggest that other theories of interval estimation should be used instead.
Project description:a- Self –Self Hybridisation were used to set confidence 99% interval using RNA from non-tethered cell lines that is both labelled in with cy3 cy5 (theoretically identical cDNA populations) to compared technical errors associated with such experiments Keywords: comparative hybridization to access expression profiles between Cy3 and Cy5 uniformally labelled template.
Project description:In the analysis of networks we frequently require the statistical significance of some network statistic, such as measures of similarity for the properties of interacting nodes. The structure of the network may introduce dependencies among the nodes and it will in general be necessary to account for these dependencies in the statistical analysis. To this end we require some form of Null model of the network: generally rewired replicates of the network are generated which preserve only the degree (number of interactions) of each node. We show that this can fail to capture important features of network structure, and may result in unrealistic significance levels, when potentially confounding additional information is available.We present a new network resampling Null model which takes into account the degree sequence as well as available biological annotations. Using gene ontology information as an illustration we show how this information can be accounted for in the resampling approach, and the impact such information has on the assessment of statistical significance of correlations and motif-abundances in the Saccharomyces cerevisiae protein interaction network. An algorithm, GOcardShuffle, is introduced to allow for the efficient construction of an improved Null model for network data.We use the protein interaction network of S. cerevisiae; correlations between the evolutionary rates and expression levels of interacting proteins and their statistical significance were assessed for Null models which condition on different aspects of the available data. The novel GOcardShuffle approach results in a Null model for annotated network data which appears better to describe the properties of real biological networks.An improved statistical approach for the statistical analysis of biological network data, which conditions on the available biological information, leads to qualitatively different results compared to approaches which ignore such annotations. In particular we demonstrate the effects of the biological organization of the network can be sufficient to explain the observed similarity of interacting proteins.
Project description:Supporting decision making in drug development is a key purpose of pharmacometric models. Pharmacokinetic models predict exposures under alternative posologies or in different populations. Pharmacodynamic models predict drug effects based on exposure to drug, disease, or other patient characteristics. Estimation uncertainty is commonly reported for model parameters; however, prediction uncertainty is the key quantity for clinical decision making. This tutorial reviews confidence and prediction intervals with associated calculation methods, encouraging pharmacometricians to report these routinely.
Project description:The univariate bootstrap is a relatively recently developed version of the bootstrap (Lee and Rodgers in Psychol Methods 3(1): 91, 1998). DeFries-Fulker (DF) analysis is a regression model used to estimate parameters in behavioral genetic models (DeFries and Fulker in Behav Genet 15(5): 467-473, 1985). It is appealing for its simplicity; however, it violates certain regression assumptions such as homogeneity of variance and independence of errors that make calculation of standard errors and confidence intervals problematic. Methods have been developed to account for these issues (Kohler and Rodgers in Behav Genet 31(2): 179-191, 2001), however the univariate bootstrap represents a unique means of doing so that is presaged by suggestions from previous DF research (e.g., Cherny et al. in Behav Genet 22(2): 153-162, 1992). In the present study we use simulations to examine the performance of the univariate bootstrap in the context of DF analysis. We compare a number of possible bootstrap schemes as well as more traditional confidence interval methods. We follow up with an empirical demonstration, applying results of the simulation to models estimated to investigate changes in body mass index in adults from the National Longitudinal Survey of Youth 1979 data.