Dataset Information

Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data.

ABSTRACT:

Background

The correct identification of differentially abundant microbial taxa between experimental conditions is a methodological and computational challenge. Recent work has produced methods to deal with the high sparsity and compositionality characteristic of microbiome data, but independent benchmarks comparing these to alternatives developed for RNA-seq data analysis are lacking.

Results

We compare methods developed for single-cell and bulk RNA-seq, and specifically for microbiome data, in terms of suitability of distributional assumptions, ability to control false discoveries, concordance, power, and correct identification of differentially abundant genera. We benchmark these methods using 100 manually curated datasets from 16S and whole metagenome shotgun sequencing.

Conclusions

The multivariate and compositional methods developed specifically for microbiome analysis did not outperform univariate methods developed for differential expression analysis of RNA-seq data. We recommend a careful exploratory data analysis prior to application of any inferential model and we present a framework to help scientists make an informed choice of analysis methods in a dataset-specific manner.

SUBMITTER: Calgaro M

PROVIDER: S-EPMC7398076 | biostudies-literature | 2020 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data.

Calgaro Matteo M Romualdi Chiara C Waldron Levi L Risso Davide D Vitulo Nicola N

Genome biology 20200803 1

<h4>Background</h4>The correct identification of differentially abundant microbial taxa between experimental conditions is a methodological and computational challenge. Recent work has produced methods to deal with the high sparsity and compositionality characteristic of microbiome data, but independent benchmarks comparing these to alternatives developed for RNA-seq data analysis are lacking.<h4>Results</h4>We compare methods developed for single-cell and bulk RNA-seq, and specifically for micr ...[more]

PMID: 32746888

Similar Datasets

Project description:BackgroundSingle-cell RNA Sequencing is gaining popularity in recent years. Compared to bulk RNA-Seq, single-cell RNA Sequencing allows the gene expression being measured within individual cells instead of mean gene expression levels across all cells in the sample. Thus, cell-to-cell variation of gene expressions could be examined. Gene differential expression analysis remains the major purpose in most single-cell RNA sequencing experiments and many methods have been developed in recent years to conduct gene differential expression analysis for single-cell RNA sequencing data.ResultsThrough simulation studies and real data examples, we evaluated the performance of five open-source popular methods used for gene differential expression analysis in single-cell RNA sequencing data. The five methods included DEsingle (Zero-inflated negative binomial model), Linnorm (Empirical Bayes method on transformed count data using the limma package), monocle (An approximate Chi-Square likelihood ratio test), MAST (A generalized linear hurdle model), and DESeq2 (A generalized linear model with empirical Bayes approach and also commonly used for bulk RNA sequencing differential express analyses). We assessed the false discovery rate (FDR) control, sensitivity, specificity, accuracy, and area under the receiver operating characteristics (AUROC) curve for all five methods under different sample sizes, distribution assumptions, and proportions of zeros in the data.ConclusionsWe found the MAST method performed the best among the five methods compared with the largest AUROC values across all tested sample sizes and different proportion of truly differential expressed genes, when the data followed negative binomial distributions. When the sample size increased to 100 in each group, the MAST method showed the best performance with the highest AUROC regardless of the data distributions. If the excess zeros were first filtered out before the gene differential analyses, the DESingle, Linnorm, and DESeq2 performed relatively better than the MAST and the monocle methods with higher AUROC values.

Project description:Finding the right balance of quality and quantity can be important, and it is essential that project quality does not drop below the level where important main conclusions are missed or misstated. We use knock-out and over-expression studies as a simplification to test recovery of a known causal gene in RNA-Seq cell line experiments. When single-end RNA-Seq reads are aligned with STAR and quantified with htseq-count, we found potential value in testing the use of the Generalized Linear Model (GLM) implementation of edgeR with robust dispersion estimation more frequently for either single-variate or multi-variate 2-group comparisons (with the possibility of defining criteria less stringent than |fold-change| > 1.5 and FDR < 0.05). When considering a limited number of patient sample comparisons with larger sample size, there might be some decreased variability between methods (except for DESeq1). However, at the same time, the ranking of the gene identified using immunohistochemistry (for ER/PR/HER2 in breast cancer samples from The Cancer Genome Atlas) showed as possible shift in performance compared to the cell line comparisons, potentially highlighting utility for standard statistical tests and/or limma-based analysis with larger sample sizes. If this continues to be true in additional studies and comparisons, then that could be consistent with the possibility that it may be important to allocate time for potential methods troubleshooting for genomics projects. Analysis of public data presented in this study does not consider all experimental designs, and presentation of downstream analysis is limited. So, any estimate from this simplification would be an underestimation of the true need for some methods testing for every project. Additionally, this set of independent cell line experiments has a limitation in being able to determine the frequency of missing a highly important gene if the problem is rare (such as 10% or lower). For example, if there was an assumption that only one method can be tested for "initial" analysis, then it is not completely clear to the extent that using edgeR-robust might perform better than DESeq2 in the cell line experiments. Importantly, we do not wish to cause undue concern, and we believe that it should often be possible to define a gene expression differential expression workflow that is suitable for some purposes for many samples. Nevertheless, at the same time, we provide a variety of measures that we believe emphasize the need to critically assess every individual project and maximize confidence in published results.

Dataset Information

Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data.

Background

Results

Conclusions

Publications

Assessment of statistical methods from single cell, bulk RNA-seq, and metagenomics applied to microbiome data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets