Dataset Information

Measuring the reproducibility and quality of Hi-C data.

ABSTRACT:

Background

Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study.

Results

Using real and simulated data, we profile the performance of several recently proposed methods for assessing reproducibility of population Hi-C data, including HiCRep, GenomeDISCO, HiC-Spector, and QuASAR-Rep. By explicitly controlling noise and sparsity through simulations, we demonstrate the deficiencies of performing simple correlation analysis on pairs of matrices, and we show that methods developed specifically for Hi-C data produce better measures of reproducibility. We also show how to use established measures, such as the ratio of intra- to interchromosomal interactions, and novel ones, such as QuASAR-QC, to identify low-quality experiments.

Conclusions

In this work, we assess reproducibility and quality measures by varying sequencing depth, resolution and noise levels in Hi-C data from 13 cell lines, with two biological replicates each, as well as 176 simulated matrices. Through this extensive validation and benchmarking of Hi-C data, we describe best practices for reproducibility and quality assessment of Hi-C experiments. We make all software publicly available at http://github.com/kundajelab/3DChromatin_ReplicateQC to facilitate adoption in the community.

SUBMITTER: Yardımcı GG

PROVIDER: S-EPMC6423771 | biostudies-literature | 2019 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Measuring the reproducibility and quality of Hi-C data.

Yardımcı Galip Gürkan GG Ozadam Hakan H Sauria Michael E G MEG Ursu Oana O Yan Koon-Kiu KK Yang Tao T Chakraborty Abhijit A Kaul Arya A Lajoie Bryan R BR Song Fan F Zhan Ye Y Ay Ferhat F Gerstein Mark M Kundaje Anshul A Li Qunhua Q Taylor James J Yue Feng F Dekker Job J Noble William S WS

Genome biology 20190319 1

<h4>Background</h4>Hi-C is currently the most widely used assay to investigate the 3D organization of the genome and to study its role in gene regulation, DNA replication, and disease. However, Hi-C experiments are costly to perform and involve multiple complex experimental steps; thus, accurate methods for measuring the quality and reproducibility of Hi-C data are essential to determine whether the output should be used further in a study.<h4>Results</h4>Using real and simulated data, we profil ...[more]

PMID: 30890172

Similar Datasets

Project description:BackgroundCorrelation metrics are widely utilized in genomics analysis and often implemented with little regard to assumptions of normality, homoscedasticity, and independence of values. This is especially true when comparing values between replicated sequencing experiments that probe chromatin accessibility, such as assays for transposase-accessible chromatin via sequencing (ATAC-seq). Such data can possess several regions across the human genome with little to no sequencing depth and are thus non-normal with a large portion of zero values. Despite distributed use in the epigenomics field, few studies have evaluated and benchmarked how correlation and association statistics behave across ATAC-seq experiments with known differences or the effects of removing specific outliers from the data. Here, we developed a computational simulation of ATAC-seq data to elucidate the behavior of correlation statistics and to compare their accuracy under set conditions of reproducibility.ResultsUsing these simulations, we monitored the behavior of several correlation statistics, including the Pearson's R and Spearman's [Formula: see text] coefficients as well as Kendall's [Formula: see text] and Top-Down correlation. We also test the behavior of association measures, including the coefficient of determination R[Formula: see text], Kendall's W, and normalized mutual information. Our experiments reveal an insensitivity of most statistics, including Spearman's [Formula: see text], Kendall's [Formula: see text], and Kendall's W, to increasing differences between simulated ATAC-seq replicates. The removal of co-zeros (regions lacking mapped sequenced reads) between simulated experiments greatly improves the estimates of correlation and association. After removing co-zeros, the R[Formula: see text] coefficient and normalized mutual information display the best performance, having a closer one-to-one relationship with the known portion of shared, enhanced loci between simulated replicates. When comparing values between experimental ATAC-seq data using a random forest model, mutual information best predicts ATAC-seq replicate relationships.ConclusionsCollectively, this study demonstrates how measures of correlation and association can behave in epigenomics experiments. We provide improved strategies for quantifying relationships in these increasingly prevalent and important chromatin accessibility assays.

Dataset Information

Measuring the reproducibility and quality of Hi-C data.

Background

Results

Conclusions

Publications

Measuring the reproducibility and quality of Hi-C data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets