Dataset Information

Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis.

ABSTRACT: Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be technical or biological in nature.We present three illustrative data analyses using surrogate variable analysis (SVA) and describe how to perform artifact discovery in light of natural heterogeneity within biological groups, secondary biological questions of interest, and non-linear treatment effects in a dataset profiling differentiating pluripotent cells (GSE32923) and another from human brain tissue (GSE30272).Careful specification of biological effects of interest is very important to factor-based approaches like SVA. We demonstrate greatly sharpened global and gene-specific differential expression across treatment groups in stem cell systems. Similarly, we demonstrate how to preserve major non-linear effects of age across the lifespan in the brain dataset. However, the gains in precisely defining known effects of interest come at the cost of much other information in the "cleaned" data, including sex, common copy number effects and sample or cell line-specific molecular behavior.Our analyses indicate that data "cleaning" can be an important component of high-throughput genomic data analysis when interrogating explicitly defined effects in the context of data affected by robust technical artifacts. However, caution should be exercised to avoid removing biological signal of interest. It is also important to note that open data exploration is not possible after such supervised "cleaning", because effects beyond those stipulated by the researcher may have been removed. With the goal of making these statistical algorithms more powerful and transparent to researchers in the biological sciences, we provide exploratory plots and accompanying R code for identifying and guiding "cleaning" process (https://github.com/andrewejaffe/StemCellSVA). The impact of these methods is significant enough that we have made newly processed data available for the brain data set at http://braincloud.jhmi.edu/plots/ and GSE30272.

SUBMITTER: Jaffe AE

PROVIDER: S-EPMC4636836 | biostudies-literature | 2015 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis.

Jaffe Andrew E AE Hyde Thomas T Kleinman Joel J Weinbergern Daniel R DR Chenoweth Joshua G JG McKay Ronald D RD Leek Jeffrey T JT Colantuoni Carlo C

BMC bioinformatics 20151106

<h4>Background</h4>Genomic data production is at its highest level and continues to increase, making available novel primary data and existing public data to researchers for exploration. Here we explore the consequences of "batch" correction for biological discovery in two publicly available expression datasets. We consider this to include the estimation of and adjustment for wide-spread systematic heterogeneity in genomic measurements that is unrelated to the effects under study, whether it be ...[more]

PMID: 26545828

Dataset Information

Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis.

Publications

Practical impacts of genomic data "cleaning" on biological discovery using surrogate variable analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Discovery of biological networks from diverse functional genomic data.
| S-EPMC1414113 | biostudies-literature

Biological and practical implications of genome-wide association study of schizophrenia using Bayesian variable selection.
| S-EPMC6863898 | biostudies-literature

Impacts of exhaust gas cleaning systems (EGCS) discharge waters on planktonic biological indicators.
| S-EPMC10152311 | biostudies-literature

Federated discovery and sharing of genomic data using Beacons.
| S-EPMC6728157 | biostudies-literature

Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction.
| S-EPMC4173013 | biostudies-literature

Responsible, practical genomic data sharing that accelerates research.
| S-EPMC7974070 | biostudies-literature

Interpreting Microbial Biosynthesis in the Genomic Age: Biological and Practical Considerations.
| S-EPMC5484115 | biostudies-literature

Evaluating surrogate marker information using censored data.
| S-EPMC5413393 | biostudies-literature

Variable selection in omics data: A practical evaluation of small sample sizes.
| S-EPMC6013185 | biostudies-literature

FACTERA: a practical method for the discovery of genomic rearrangements at breakpoint resolution.
| S-EPMC4296148 | biostudies-literature