Dataset Information

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

ABSTRACT: PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments.In this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.The method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates .

SUBMITTER: Bansal V

PROVIDER: S-EPMC5374682 | biostudies-literature | 2017 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

Bansal Vikas V

BMC bioinformatics 20170314 Suppl 3

<h4>Background</h4>PCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-s ...[more]

PMID: 28361665

Dataset Information

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

Publications

A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

A computational method to aid the design and analysis of single cell RNA-seq experiments for cell type identification.
| S-EPMC6551246 | biostudies-literature

Impact of adaptive filtering on power and false discovery rate in RNA-seq experiments.
| S-EPMC9509565 | biostudies-literature

A novel min-cost flow method for estimating transcript expression with RNA-Seq.
| S-EPMC3622638 | biostudies-literature

iRNA-seq: computational method for genome-wide assessment of acute transcriptional regulation from total RNA-seq data.
| S-EPMC4381047 | biostudies-literature

SeqOthello: querying RNA-seq experiments at scale.
| S-EPMC6194578 | biostudies-literature

Vicinal: a method for the determination of ncRNA ends using chimeric reads from RNA-seq experiments.
| S-EPMC4027162 | biostudies-literature

Synthetic spike-in standards for RNA-seq experiments.
| S-EPMC3166838 | biostudies-literature

Guidelines for reporting single-cell RNA-seq experiments.
| S-EPMC9302581 | biostudies-literature

Computational method for estimating progression saturation of analog series.
| S-EPMC9078142 | biostudies-literature

Computational analysis of bacterial RNA-Seq data.
| S-EPMC3737546 | biostudies-literature