Dataset Information

Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations.

ABSTRACT: Validating the sampling depth and reducing sequencing errors are critical for studies of viral populations using next-generation sequencing (NGS). We previously described the use of Primer ID to tag each viral RNA template with a block of degenerate nucleotides in the cDNA primer. We now show that low-abundance Primer IDs (offspring Primer IDs) are generated due to PCR/sequencing errors. These artifactual Primer IDs can be removed using a cutoff model for the number of reads required to make a template consensus sequence. We have modeled the fraction of sequences lost due to Primer ID resampling. For a typical sequencing run, less than 10% of the raw reads are lost to offspring Primer ID filtering and resampling. The remaining raw reads are used to correct for PCR resampling and sequencing errors. We also demonstrate that Primer ID reveals bias intrinsic to PCR, especially at low template input or utilization. cDNA synthesis and PCR convert ca. 20% of RNA templates into recoverable sequences, and 30-fold sequence coverage recovers most of these template sequences. We have directly measured the residual error rate to be around 1 in 10,000 nucleotides. We use this error rate and the Poisson distribution to define the cutoff to identify preexisting drug resistance mutations at low abundance in an HIV-infected subject. Collectively, these studies show that >90% of the raw sequence reads can be used to validate template sampling depth and to dramatically reduce the error rate in assessing a genetically diverse viral population using NGS.Although next-generation sequencing (NGS) has revolutionized sequencing strategies, it suffers from serious limitations in defining sequence heterogeneity in a genetically diverse population, such as HIV-1 due to PCR resampling and PCR/sequencing errors. The Primer ID approach reveals the true sampling depth and greatly reduces errors. Knowing the sampling depth allows the construction of a model of how to maximize the recovery of sequences from input templates and to reduce resampling of the Primer ID so that appropriate multiplexing can be included in the experimental design. With the defined sampling depth and measured error rate, we are able to assign cutoffs for the accurate detection of minority variants in viral populations. This approach allows the power of NGS to be realized without having to guess about sampling depth or to ignore the problem of PCR resampling, while also being able to correct most of the errors in the data set.

SUBMITTER: Zhou S

PROVIDER: S-EPMC4524263 | biostudies-literature | 2015 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations.

Zhou Shuntai S Jones Corbin C Mieczkowski Piotr P Swanstrom Ronald R

Journal of virology 20150603 16

<h4>Unlabelled</h4>Validating the sampling depth and reducing sequencing errors are critical for studies of viral populations using next-generation sequencing (NGS). We previously described the use of Primer ID to tag each viral RNA template with a block of degenerate nucleotides in the cDNA primer. We now show that low-abundance Primer IDs (offspring Primer IDs) are generated due to PCR/sequencing errors. These artifactual Primer IDs can be removed using a cutoff model for the number of reads r ...[more]

PMID: 26041299

Similar Datasets

Project description:BackgroundClinical implementation of Next-Generation Sequencing (NGS) is challenged by poor control for stochastic sampling, library preparation biases and qualitative sequencing error. To address these challenges we developed and tested two hypotheses.MethodsHypothesis 1: Analytical variation in quantification is predicted by stochastic sampling effects at input of a) amplifiable nucleic acid target molecules into the library preparation, b) amplicons from library into sequencer, or c) both. We derived equations using Monte Carlo simulation to predict assay coefficient of variation (CV) based on these three working models and tested them against NGS data from specimens with well characterized molecule inputs and sequence counts prepared using competitive multiplex-PCR amplicon-based NGS library preparation method comprising synthetic internal standards (IS). Hypothesis 2: Frequencies of technically-derived qualitative sequencing errors (i.e., base substitution, insertion and deletion) observed at each base position in each target native template (NT) are concordant with those observed in respective competitive synthetic IS present in the same reaction. We measured error frequencies at each base position within amplicons from each of 30 target NT, then tested whether they correspond to those within the 30 respective IS.ResultsFor hypothesis 1, the Monte Carlo model derived from both sampling events best predicted CV and explained 74% of observed assay variance. For hypothesis 2, observed frequency and type of sequence variation at each base position within each IS was concordant with that observed in respective NTs (R2 = 0.93).ConclusionIn targeted NGS, synthetic competitive IS control for stochastic sampling at input of both target into library preparation and of target library product into sequencer, and control for qualitative errors generated during library preparation and sequencing. These controls enable accurate clinical diagnostic reporting of confidence limits and limit of detection for copy number measurement, and of frequency for each actionable mutation.

Project description:The polymerase chain reaction (PCR) is sensitive to mismatches between primer and template, and mismatches can lead to inefficient amplification of targeted regions of DNA template. In PCRs in which a degenerate primer pool is employed, each primer can behave differently. Therefore, inefficiencies due to different primer melting temperatures within a degenerate primer pool, in addition to mismatches between primer binding sites and primers, can lead to a distortion of the true relative abundance of targets in the original DNA pool. A theoretical analysis indicated that a combination of primer-template and primer-amplicon interactions during PCR cycles 3-12 is potentially responsible for this distortion. To test this hypothesis, we developed a novel amplification strategy, entitled "Polymerase-exonuclease (PEX) PCR", in which primer-template interactions and primer-amplicon interactions are separated. The PEX PCR method substantially and significantly improved the evenness of recovery of sequences from a mock community of known composition, and allowed for amplification of templates with introduced mismatches near the 3' end of the primer annealing sites. When the PEX PCR method was applied to genomic DNA extracted from complex environmental samples, a significant shift in the observed microbial community was detected. Furthermore, the PEX PCR method provides a mechanism to identify which primers in a primer pool are annealing to target gDNA. Primer utilization patterns revealed that at high annealing temperatures in the PEX PCR method, perfect match annealing predominates, while at lower annealing temperatures, primers with up to four mismatches with templates can contribute substantially to amplification. The PEX PCR method is simple to perform, is limited to PCR mixes and a single exonuclease step which can be performed without reaction cleanup, and is recommended for reactions in which degenerate primer pools are used or when mismatches between primers and template are possible.

Dataset Information

Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations.

Publications

Primer ID Validates Template Sampling Depth and Greatly Reduces the Error Rate of Next-Generation Sequencing of HIV-1 Genomic RNA Populations.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets