Dataset Information

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

ABSTRACT: BACKGROUND:The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the-art statistical methods use generalized linear models (GLM) coupled with the negative-binomial (NB) distribution by leveraging its ability for simultaneous bias correction and signal detection. However, although statistically powerful, the GLM+NB method has a quadratic computational complexity and therefore suffers from slow running time when applied to large-scale windowed read-count data. In this study, we aimed to speed up substantially the GLM+NB method by using a randomized algorithm and we demonstrate here the utility of our approach in the application of detecting copy number variants (CNVs) using a real example. RESULTS:We propose an efficient estimator, the randomized GLM+NB coefficients estimator (RGE), for speeding up the GLM+NB method. RGE samples the read-count data and solves the estimation problem on a smaller scale. We first theoretically validated the consistency and the variance properties of RGE. We then applied RGE to GENSENG, a GLM+NB based method for detecting CNVs. We named the resulting method as "R-GENSENG". Based on extensive evaluation using both simulated and empirical data, we concluded that R-GENSENG is ten times faster than the original GENSENG while maintaining GENSENG's accuracy in CNV detection. CONCLUSIONS:Our results suggest that RGE strategy developed here could be applied to other GLM+NB based read-count analyses, i.e. ChIP-seq data analysis, to substantially improve their computational efficiency while preserving the analytic power.

SUBMITTER: Wang W

PROVIDER: S-EPMC5831535 | biostudies-literature | 2018 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

Wang WeiBo W Sun Wei W Wang Wei W Szatkiewicz Jin J

BMC bioinformatics 20180301 1

<h4>Background</h4>The application of high-throughput sequencing in a broad range of quantitative genomic assays (e.g., DNA-seq, ChIP-seq) has created a high demand for the analysis of large-scale read-count data. Typically, the genome is divided into tiling windows and windowed read-count data is generated for the entire genome from which genomic signals are detected (e.g. copy number changes in DNA-seq, enrichment peaks in ChIP-seq). For accurate analysis of read-count data, many state-of-the- ...[more]

PMID: 29490610

Similar Datasets

Project description:Similar to other droplet-based single cell assays, single nucleus ATAC-seq (snATAC-seq) data harbor multiplets that confound downstream analyses. Detecting multiplets in snATAC-seq data is particularly challenging due to data sparsity and limited dynamic range (0 reads: closed chromatin, 1: open on in one parental chromosome allele, 2: open on in both alleles chromosomes). Yet, these unique data features offer an opportunity to identify multiplets. ATAC-DoubletDetector (https://ucarlab.github.io/ATAC-DoubletDetector/) AMULET (Atac MULtiplet Estimation Tool) exploits these unique features to detect multiplets by studying enumerates the number of regions with >2 uniquely aligned reads across the genome to effectively detect multiplets - an effective alternative to methods based on artificially-generated multiplets. We evaluated the method by generating snATAC-seq data (e.g., state-of-the-art ArchR). For benchmarking we generated snATAC-seq data and generated data fromeasured the efficacy of AMULET inm in two primary human tissues: peripheral human blood mononuclear cells (PBMCs) and pancreatic islet samples. AMULET detects had high multiplets with an estimated precision (estimated via donor-based multiplexing) and high recall (estimated via simulated doublets) compared to alternatives 0.57 precision and achieves 0.85 recall. When and was the most effective when a certain read depth is achieved (a certain read depth per nucleus is achieved samples are sequenced deeply (e.g., median read count per nucleus >20K25K) reads per nucleus in PBMCs), ATAC-DoubletDetector captured 85% of simulated doublets (i.e., recall), significantly outperforming ArchR (24%). For lower read depth, ATAC-DoubletDetector and ArchR produced complementary results. Moreover, ATAC-DoubletDetector was equally effective in identifying homotypic multiplets (i.e., multiplets from the same cell type), which are missed by simulation-based methods. Cell-specific marker peaks enabled accurate (85%) tracing of cellular origins of snATAC-seq multiplets. Accordingly, more abundant cells within a tissue are more likely to form multiplets and the majority of multiplets are homotypic. ATAC-DoubletDetector is a fast and effective multiplet detection/annotation tool for improved single cell epigenomic data analyses across diverse biological systems and conditions.

Dataset Information

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

Publications

A randomized approach to speed up the analysis of large-scale read-count data in the application of CNV detection.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets