Unknown

Dataset Information

0

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.


ABSTRACT: In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) procedures. One main assumption is that the mutation counts follow a zero-inflated model in order to account for the true zeros in the count model and the excess zeros. The class of models considered is the Zero-inflated Generalized Poisson (ZIGP) distribution. Furthermore, we assumed that there exists a cut-off value such that smaller counts than this value are generated from the null distribution. We present several data-dependent methods to determine the cut-off value. We also consider a two-stage procedure based on screening process so that the number of mutations exceeding a certain value should be considered as significant mutations. Simulated and protein domain data sets are used to illustrate this procedure in estimation of the empirical null using a mixture of discrete distributions. Overall, while maintaining control of the FDR, the proposed two-stage testing procedure has superior empirical power.

SUBMITTER: Gauran IIM 

PROVIDER: S-EPMC5862774 | biostudies-other | 2018 Jun

REPOSITORIES: biostudies-other

altmetric image

Publications

Empirical null estimation using zero-inflated discrete mixture distributions and its application to protein domain data.

Gauran Iris Ivy M IIM   Park Junyong J   Lim Johan J   Park DoHwan D   Zylstra John J   Peterson Thomas T   Kann Maricel M   Spouge John L JL  

Biometrics 20170922 2


In recent mutation studies, analyses based on protein domain positions are gaining popularity over gene-centric approaches since the latter have limitations in considering the functional context that the position of the mutation provides. This presents a large-scale simultaneous inference problem, with hundreds of hypothesis tests to consider at the same time. This article aims to select significant mutation counts while controlling a given level of Type I error via False Discovery Rate (FDR) pr  ...[more]

Similar Datasets

| S-EPMC7768662 | biostudies-literature
| S-EPMC6763381 | biostudies-literature
| S-EPMC7594114 | biostudies-literature
| S-EPMC8319482 | biostudies-literature
| S-EPMC7308073 | biostudies-literature