Dataset Information

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data.

ABSTRACT: BACKGROUND:Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets present significant analytical challenges due to batch and tissue effects, but provide the opportunity to revisit the assumptions and methods that we use to preprocess, normalize, and filter RNA-Seq data - critical first steps for any subsequent analysis. RESULTS:We find that analysis of large RNA-Seq data sets requires both careful quality control and the need to account for sparsity due to the heterogeneity intrinsic in multi-group studies. We developed Yet Another RNA Normalization software pipeline (YARN), that includes quality control and preprocessing, gene filtering, and normalization steps designed to facilitate downstream analysis of large, heterogeneous RNA-Seq data sets and we demonstrate its use with data from the Genotype-Tissue Expression (GTEx) project. CONCLUSIONS:An R package instantiating YARN is available at http://bioconductor.org/packages/yarn .

SUBMITTER: Paulson JN

PROVIDER: S-EPMC5627434 | biostudies-literature | 2017 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data.

Paulson Joseph N JN Chen Cho-Yi CY Lopes-Ramos Camila M CM Kuijjer Marieke L ML Platig John J Sonawane Abhijeet R AR Fagny Maud M Glass Kimberly K Quackenbush John J

BMC bioinformatics 20171003 1

<h4>Background</h4>Although ultrahigh-throughput RNA-Sequencing has become the dominant technology for genome-wide transcriptional profiling, the vast majority of RNA-Seq studies typically profile only tens of samples, and most analytical pipelines are optimized for these smaller studies. However, projects are generating ever-larger data sets comprising RNA-Seq data from hundreds or thousands of samples, often collected at multiple centers and from diverse tissues. These complex data sets presen ...[more]

PMID: 28974199

Dataset Information

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data.

Publications

Tissue-aware RNA-Seq processing and normalization for heterogeneous and sparse data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

GC-content normalization for RNA-Seq data.
| S-EPMC3315510 | biostudies-literature

Cell type-aware analysis of RNA-seq data.
| S-EPMC8697413 | biostudies-literature

An Integrated Approach for RNA-seq Data Normalization.
| S-EPMC4924883 | biostudies-literature

Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R.
| S-EPMC5408845 | biostudies-literature

SCnorm: robust normalization of single-cell RNA-seq data.
| S-EPMC5473255 | biostudies-literature

A graph-based algorithm for RNA-seq data normalization.
| S-EPMC6980396 | biostudies-literature

The Impact of Normalization Methods on RNA-Seq Data Analysis.
| S-EPMC4484837 | biostudies-literature

RUV-III-NB: normalization of single cell RNA-seq data.
| S-EPMC9458465 | biostudies-literature

PsiNorm: a scalable normalization for single-cell RNA-seq data.
| S-EPMC8696108 | biostudies-literature

Normalization Methods on Single-Cell RNA-seq Data: An Empirical Survey.
| S-EPMC7019105 | biostudies-literature