Unknown

Dataset Information

0

Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset.


ABSTRACT:

Background

Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher.

Results

We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project.

Conclusions

Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.

SUBMITTER: Somekh J 

PROVIDER: S-EPMC6537327 | biostudies-literature | 2019 May

REPOSITORIES: biostudies-literature

altmetric image

Publications

Batch correction evaluation framework using a-priori gene-gene associations: applied to the GTEx dataset.

Somekh Judith J   Shen-Orr Shai S SS   Kohane Isaac S IS  

BMC bioinformatics 20190528 1


<h4>Background</h4>Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher.<h4>Results</h4>We describe a novel framework, B-CeF, to evaluate the effectiveness of batch corr  ...[more]

Similar Datasets

| S-EPMC9985330 | biostudies-literature
| S-EPMC9985174 | biostudies-literature
| S-EPMC7530651 | biostudies-literature
| S-EPMC4567538 | biostudies-literature
| S-EPMC10155362 | biostudies-literature
| S-EPMC6706913 | biostudies-literature
| S-EPMC9050722 | biostudies-literature
2016-12-12 | GSE53355 | GEO
| S-EPMC5788449 | biostudies-literature
| S-EPMC7145015 | biostudies-literature