Dataset Information

Statistical Uncertainty Analysis for Small-Sample, High Log-Variance Data: Cautions for Bootstrapping and Bayesian Bootstrapping.

ABSTRACT: Recent advances in molecular simulations allow the evaluation of previously unattainable observables, such as rate constants for protein folding. However, these calculations are usually computationally expensive, and even significant computing resources may result in a small number of independent estimates spread over many orders of magnitude. Such small-sample, high "log-variance" data are not readily amenable to analysis using the standard uncertainty (i.e., "standard error of the mean") because unphysical negative limits of confidence intervals result. Bootstrapping, a natural alternative guaranteed to yield a confidence interval within the minimum and maximum values, also exhibits a striking systematic bias of the lower confidence limit in log space. As we show, bootstrapping artifactually assigns high probability to improbably low mean values. A second alternative, the Bayesian bootstrap strategy, does not suffer from the same deficit and is more logically consistent with the type of confidence interval desired. The Bayesian bootstrap provides uncertainty intervals that are more reliable than those from the standard bootstrap method but must be used with caution nevertheless. Neither standard nor Bayesian bootstrapping can overcome the intrinsic challenge of underestimating the mean from small-size, high log-variance samples. Our conclusions are based on extensive analysis of model distributions and reanalysis of multiple independent atomistic simulations. Although we only analyze rate constants, similar considerations will apply to related calculations, potentially including highly nonlinear averages like the Jarzynski relation.

SUBMITTER: Mostofian B

PROVIDER: S-EPMC6754704 | biostudies-literature | 2019 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Statistical Uncertainty Analysis for Small-Sample, High Log-Variance Data: Cautions for Bootstrapping and Bayesian Bootstrapping.

Mostofian Barmak B Zuckerman Daniel M DM

Journal of chemical theory and computation 20190507 6

Recent advances in molecular simulations allow the evaluation of previously unattainable observables, such as rate constants for protein folding. However, these calculations are usually computationally expensive, and even significant computing resources may result in a small number of independent estimates spread over many orders of magnitude. Such small-sample, high "log-variance" data are not readily amenable to analysis using the standard uncertainty (i.e., "standard error of the mean") becau ...[more]

PMID: 31002504

Similar Datasets

Project description:BackgroundStatistical inference based on small datasets, commonly found in precision oncology, is subject to low power and high uncertainty. In these settings, drawing strong conclusions about future research utility is difficult when using standard inferential measures. It is therefore important to better quantify the uncertainty associated with both significant and non-significant results based on small sample sizes.MethodsWe developed a new method, Bayesian Additional Evidence (BAE), that determines (1) how much additional supportive evidence is needed for a non-significant result to reach Bayesian posterior credibility, or (2) how much additional opposing evidence is needed to render a significant result non-credible. Although based in Bayesian analysis, a prior distribution is not needed; instead, the tipping point output is compared to reasonable effect ranges to draw conclusions. We demonstrate our approach in a comparative effectiveness analysis comparing two treatments in a real world biomarker-defined cohort, and provide guidelines for how to apply BAE in practice.ResultsOur initial comparative effectiveness analysis results in a hazard ratio of 0.31 with 95% confidence interval (0.09, 1.1). Applying BAE to this result yields a tipping point of 0.54; thus, an observed hazard ratio of 0.54 or smaller in a replication study would result in posterior credibility for the treatment association. Given that effect sizes in this range are not extreme, and that supportive evidence exists from a similar published study, we conclude that this problem is worthy of further research.ConclusionsOur proposed method provides a useful framework for interpreting analytic results from small datasets. This can assist researchers in deciding how to interpret and continue their investigations based on an initial analysis that has high uncertainty. Although we illustrated its use in estimating parameters based on time-to-event outcomes, BAE easily applies to any normally-distributed estimator, such as those used for analyzing binary or continuous outcomes.

Project description:Uncertainty analysis is the process of identifying limitations in scientific knowledge and evaluating their implications for scientific conclusions. In the context of microbial risk assessment, the uncertainty in the predicted microbial behavior can be an important component of the overall uncertainty. Conventional deterministic modeling approaches which provide point estimates of the pathogen's levels cannot quantify the uncertainty around the predictions. The objective of this study was to use Bayesian statistical modeling for describing uncertainty in predicted microbial thermal inactivation of Salmonella enterica Typhimurium DT104. A set of thermal inactivation data in broth with water activity adjusted to 0.75 at 9 different temperature conditions obtained from the ComBase database (www.combase.cc) was used. A log-linear microbial inactivation was used as a primary model while for secondary modeling, a linear relation between the logarithm of inactivation rate and temperature was assumed. For comparison, data were fitted with a two-step and a global Bayesian regression. Posterior distributions of model's parameters were used to predict Salmonella thermal inactivation. The combination of the joint posterior distributions of model's parameters allowed the prediction of cell density over time, total reduction time and inactivation rate as probability distributions at different time and temperature conditions. For example, for the time required to eliminate a Salmonella population of about 107 CFU/ml at 65°C, the model predicted a time distribution with a median of 0.40 min and 5th and 95th percentiles of 0.24 and 0.60 min, respectively. The validation of the model showed that it can describe successfully uncertainty in predicted thermal inactivation with most observed data being within the 95% prediction intervals of the model. The global regression approach resulted in less uncertain predictions compared to the two-step regression. The developed model could be used to quantify uncertainty in thermal inactivation in risk-based processing design as well as in risk assessment studies.

Dataset Information

Statistical Uncertainty Analysis for Small-Sample, High Log-Variance Data: Cautions for Bootstrapping and Bayesian Bootstrapping.

Publications

Statistical Uncertainty Analysis for Small-Sample, High Log-Variance Data: Cautions for Bootstrapping and Bayesian Bootstrapping.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets