Dataset Information

The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis.

ABSTRACT: Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with an implicit assumption that the multivariate expression values are normally distributed. This assumption is commonly adopted in GSAs, particularly those in the group of functional class scoring (FCS) methods. The validity of the normality assumption, however, has been disputed in several studies, yet no systematic analysis has been carried out to assess the effect of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal (MVN) distribution. Six statistical methods in three categories of MVN tests were considered and applied to a total of 24 RNA data sets. These RNA values were collected from cancer patients as well as normal subjects, and the values were derived from microarray experiments, RNA sequencing, and single-cell RNA sequencing. Our first finding suggests that the MVN assumption is not always satisfied. This assumption does not hold true in many applications tested here. In the second part of this research, we evaluated the influence of non-normality on the statistical power of current FCS methods, both parametric and nonparametric ones. Specifically, the scenario of mixture distributions representing more than one population for the RNA values was considered. This second investigation demonstrates that the non-normality distribution of the RNA values causes a loss in the statistical power of these GSA tests, especially when subtypes exist. Among the FCS GSA tools examined here and among the scenarios studied in this research, the N-statistics outperform the others. Based on the results from these two investigations, we conclude that the assumption of MVN should be used with caution when evaluating new GSA tools, since this assumption cannot be guaranteed and violation may lead to spurious results, loss of power, and incorrect comparison between methods. If a newly proposed GSA tool is to be evaluated, we recommend the incorporation of a wide range of multivariate non-normal distributions or sampling from large databases if available.

SUBMITTER: Ho CH

PROVIDER: S-EPMC8728032 | biostudies-literature | 2022 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis.

Ho Chi-Hsuan CH Huang Yu-Jyun YJ Lai Ying-Ju YJ Mukherjee Rajarshi R Hsiao Chuhsing Kate CK

G3 (Bethesda, Md.) 20220101 1

Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with an implicit assumption that the multivariate expression values are normally distributed. This assumption is commonly adopted in GSAs, particularly those in the group of functional class scoring (FC ...[more]

PMID: 34791175

Dataset Information

The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis.

Publications

The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Addressing erroneous scale assumptions in microbe and gene set enrichment analysis.
| S-EPMC10695402 | biostudies-literature

Robust extraction of functional signals from gene set analysis using a generalized threshold free scoring function.
| S-EPMC2761411 | biostudies-literature

Resolving missing protein problems using functional class scoring.
| S-EPMC9256666 | biostudies-literature

Functional-network-based gene set analysis using gene-ontology.
| S-EPMC3572115 | biostudies-literature

Pathway-targeting gene matrix for Drosophila gene set enrichment analysis.
| S-EPMC8553153 | biostudies-literature

CEA: Combination-based gene set functional enrichment analysis.
| S-EPMC6117355 | biostudies-literature

GAGE: generally applicable gene set enrichment for pathway analysis.
| S-EPMC2696452 | biostudies-literature

Gene set analysis exploiting the topology of a pathway.
| S-EPMC2945950 | biostudies-literature

Characterizing transcriptional heterogeneity through pathway and gene set overdispersion analysis.
| S-EPMC4772672 | biostudies-literature

Comparison of gene set scoring methods for reproducible evaluation of tuberculosis gene signatures.
| S-EPMC11191245 | biostudies-literature