Dataset Information

Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias.

ABSTRACT: Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including reads per kilobase of transcript length per million reads (RPKM), Trimmed Mean of M values (TMM), relative log expression (RLE), and quantile and upper-quartile normalization. Importantly, we demonstrate that this bias causes recurrent false positive calls by gene-set enrichment analysis (GSEA) methods, thereby leading to frequent functional misinterpretation of the data. Gene sets characterized by markedly short genes (e.g., ribosomal protein genes) or long genes (e.g., extracellular matrix genes) are particularly prone to such false calls. This sample-specific length bias is effectively removed by the conditional quantile normalization (cqn) and EDASeq methods, which allow the integration of gene length as a sample-specific covariate. Consequently, using these normalization methods led to substantial reduction in GSEA false results while retaining true ones. In addition, we found that application of gene-set tests that take into account gene-gene correlations attenuates false positive rates caused by the length bias, but statistical power is reduced as well. Our results advocate the inspection and correction of sample-specific length biases as default steps in RNA-seq analysis pipelines and reiterate the need to account for intergene correlations when performing gene-set enrichment tests to lessen false interpretation of transcriptomic data.

SUBMITTER: Mandelboum S

PROVIDER: S-EPMC6850523 | biostudies-literature | 2019 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias.

Mandelboum Shir S Manber Zohar Z Elroy-Stein Orna O Elkon Ran R

PLoS biology 20191112 11

Data normalization is a critical step in RNA sequencing (RNA-seq) analysis, aiming to remove systematic effects from the data to ensure that technical biases have minimal impact on the results. Analyzing numerous RNA-seq datasets, we detected a prevalent sample-specific length effect that leads to a strong association between gene length and fold-change estimates between samples. This stochastic sample-specific effect is not corrected by common normalization methods, including reads per kilobase ...[more]

PMID: 31714939

Dataset Information

Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias.

Publications

Recurrent functional misinterpretation of RNA-seq data caused by sample-specific gene length bias.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Length bias correction for RNA-seq data in gene set analyses.
| S-EPMC3042188 | biostudies-literature

Gene set analysis controlling for length bias in RNA-seq experiments.
| S-EPMC5294840 | biostudies-literature

Modeling Exon-Specific Bias Distribution Improves the Analysis of RNA-Seq Data.
| S-EPMC4598124 | biostudies-literature

IVT-seq reveals extreme bias in RNA sequencing.
| S-EPMC4197826 | biostudies-literature

iMapSplice: Alleviating reference bias through personalized RNA-seq alignment.
| S-EPMC6086400 | biostudies-literature

A new approach to bias correction in RNA-Seq.
| S-EPMC3315719 | biostudies-literature

Bias caused by sampling error in meta-analysis with small sample sizes.
| S-EPMC6136825 | biostudies-literature

Freeze-quenched maize mesophyll and bundle sheath separation uncovers bias in previous tissue-specific RNA-Seq data.
| S-EPMC5853576 | biostudies-literature

BCseq: accurate single cell RNA-seq quantification with bias correction.
| S-EPMC6101504 | biostudies-literature

Improving RNA-Seq expression estimates by correcting for fragment bias.
| S-EPMC3129672 | biostudies-literature