Dataset Information

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.

ABSTRACT: In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two-stage testing procedure has superior empirical power.

SUBMITTER: Gauran IIM

PROVIDER: S-EPMC5862774 | biostudies-other | 2018 Jun

REPOSITORIES: biostudies-other

ACCESS DATA

Publications

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.

Gauran Iris Ivy M IIM Park Junyong J Lim Johan J Park DoHwan D Zylstra John J Peterson Thomas T Kann Maricel M Spouge John L JL

Biometrics 20170922 2

In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) pr ...[more]

PMID: 28940296

Similar Datasets

Project description:Microorganisms play critical roles in human health and disease. They live in diverse communities in which they interact synergistically or antagonistically. Thus for estimating microbial associations with clinical covariates, such as treatment effects, joint (multivariate) statistical models are preferred. Multivariate models allow one to estimate and exploit complex interdependencies among multiple taxa, yielding more powerful tests of exposure or treatment effects than application of taxon-specific univariate analyses. Analysis of microbial count data also requires special attention because data commonly exhibit zero inflation, i.e., more zeros than expected from a standard count distribution. To meet these needs, we developed a Bayesian variable selection model for multivariate count data with excess zeros that incorporates information on the covariance structure of the outcomes (counts for multiple taxa), while estimating associations with the mean levels of these outcomes. Though there has been much work on zero-inflated models for longitudinal data, little attention has been given to high-dimensional multivariate zero-inflated data modeled via a general correlation structure. Through simulation, we compared performance of the proposed method to that of existing univariate approaches, for both the binary ("excess zero") and count parts of the model. When outcomes were correlated the proposed variable selection method maintained type I error while boosting the ability to identify true associations in the binary component of the model. For the count part of the model, in some scenarios the univariate method had higher power than the multivariate approach. This higher power was at a cost of a highly inflated false discovery rate not observed with the proposed multivariate method. We applied the approach to oral microbiome data from the Pediatric HIV/AIDS Cohort Oral Health Study and identified five (of 44) species associated with HIV infection.

Project description:BackgroundThe assessment of methods for analyzing over-dispersed zero inflated count outcome has received very little or no attention in stratified cluster randomized trials. In this study, we performed sensitivity analyses to empirically compare eight methods for analyzing zero inflated over-dispersed count outcome from the Vitamin D and Osteoporosis Study (ViDOS) - originally designed to assess the feasibility of a knowledge translation intervention in long-term care home setting.MethodForty long-term care (LTC) homes were stratified and then randomized into knowledge translation (KT) intervention (19 homes) and control (21 homes) groups. The homes/clusters were stratified by home size (<250/> = 250) and profit status (profit/non-profit). The outcome of this study was number of falls measured at 6-month post-intervention. The following methods were used to assess the effect of KT intervention on number of falls: i) standard Poisson and negative binomial regression; ii) mixed-effects method with Poisson and negative binomial distribution; iii) generalized estimating equation (GEE) with Poisson and negative binomial; iv) zero inflated Poisson and negative binomial - with the latter used as a primary approach. All these methods were compared with or without adjusting for stratification.ResultsA total of 5,478 older people from 40 LTC homes were included in this study. The mean (=1) of the number of falls was smaller than the variance (=6). Also 72% and 46% of the number of falls were zero in the control and intervention groups, respectively. The direction of the estimated incidence rate ratios (IRRs) was similar for all methods. The zero inflated negative binomial yielded the lowest IRRs and narrowest 95% confidence intervals when adjusted for stratification compared to GEE and mixed-effect methods. Further, the widths of the 95% confidence intervals were narrower when the methods adjusted for stratification compared to the same method not adjusted for stratification.ConclusionThe overall conclusion from the GEE, mixed-effect and zero inflated methods were similar. However, these methods differ in terms of effect estimate and widths of the confidence interval.Trial registrationClinicalTrials.gov: NCT01398527. Registered: 19 July 2011.

Dataset Information

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.

Publications

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets