Dataset Information

Naught all zeros in sequence count data are the same.

ABSTRACT: Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and show models can disagree substantially in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different zero handling models behave, we developed a conceptual framework outlining four types of processes that may give rise to zero values in sequence count data. Last, we performed simulations to test how zero handling models behave in the presence of these different zero generating processes. Our simulations showed that simple count models are sufficient across multiple processes, even when the true underlying process is unknown. On the other hand, a common zero handling technique known as "zero-inflation" was only suitable under a zero generating process associated with an unlikely set of biological and experimental conditions. In concert, our work here suggests several specific guidelines for developing and choosing state-of-the-art models for analyzing sparse sequence count data.

SUBMITTER: Silverman JD

PROVIDER: S-EPMC7568192 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Naught all zeros in sequence count data are the same.

Silverman Justin D JD Roche Kimberly K Mukherjee Sayan S David Lawrence A LA

Computational and structural biotechnology journal 20200928

Genomic studies feature multivariate count data from high-throughput DNA sequencing experiments, which often contain many zero values. These zeros can cause artifacts for statistical analyses and multiple modeling approaches have been developed in response. Here, we apply different zero-handling models to gene-expression and microbiome datasets and show models can disagree substantially in terms of identifying the most differentially expressed sequences. Next, to rationally examine how different ...[more]

PMID: 33101615

Dataset Information

Naught all zeros in sequence count data are the same.

Publications

Naught all zeros in sequence count data are the same.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Bayesian Correlation Analysis for Sequence Count Data.
| S-EPMC5049778 | biostudies-literature

Differential expression analysis for sequence count data.
| S-EPMC3218662 | biostudies-literature

A new Bayesian joint model for longitudinal count data with many zeros, intermittent missingness, and dropout with applications to HIV prevention trials.
| S-EPMC6891130 | biostudies-literature

Sequence count data are poorly fit by the negative binomial distribution.
| S-EPMC7192467 | biostudies-literature

Analysis of Microbiome Data in the Presence of Excess Zeros.
| S-EPMC5682008 | biostudies-literature

baySeq: empirical Bayesian methods for identifying differential expression in sequence count data.
| S-EPMC2928208 | biostudies-literature

A powerful and flexible approach to the analysis of RNA sequence count data.
| S-EPMC3179656 | biostudies-literature

Spatial and Spatio-Temporal Models for Modeling Epidemiological Data with Excess Zeros.
| S-EPMC4586626 | biostudies-literature

Spatial modeling of data with excessive zeros applied to reindeer pellet-group counts.
| S-EPMC5513232 | biostudies-literature

Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences.
| S-EPMC6581436 | biostudies-literature