Dataset Information

Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data.

ABSTRACT: RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences. Next generation sequencing is revolutionising the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants within the population. Identification of low frequency variants is important for our understanding of mutational dynamics, disease progression, immune pressure, and for the detection of drug resistant or pathogenic mutations. However, the current challenge is to accurately model the errors in the sequence data and distinguish real viral variants, particularly those that exist at low frequency, from errors introduced during sequencing and sample processing, which can both be substantial.We have created a novel set of laboratory control samples that are derived from a plasmid containing a full-length viral genome with extremely limited diversity in the starting population. One sample was sequenced without PCR amplification whilst the other samples were subjected to increasing amounts of RT and PCR amplification prior to ultra-deep sequencing. This enabled the level of error introduced by the RT and PCR processes to be assessed and minimum frequency thresholds to be set for true viral variant identification. We developed a genome-scale computational model of the sample processing and NGS calling process to gain a detailed understanding of the errors at each step, which predicted that RT and PCR errors are more likely to occur at some genomic sites than others. The model can also be used to investigate whether the number of observed mutations at a given site of interest is greater than would be expected from processing errors alone in any NGS data set. After providing basic sample processing information and the site's coverage and quality scores, the model utilises the fitted RT-PCR error distributions to simulate the number of mutations that would be observed from processing errors alone.These data sets and models provide an effective means of separating true viral mutations from those erroneously introduced during sample processing and sequencing.

SUBMITTER: Orton RJ

PROVIDER: S-EPMC4425905 | biostudies-literature | 2015 Mar

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data.

Orton Richard J RJ Wright Caroline F CF Morelli Marco J MJ King David J DJ Paton David J DJ King Donald P DP Haydon Daniel T DT

BMC genomics 20150324

<h4>Background</h4>RNA viruses have high mutation rates and exist within their hosts as large, complex and heterogeneous populations, comprising a spectrum of related but non-identical genome sequences. Next generation sequencing is revolutionising the study of viral populations by enabling the ultra deep sequencing of their genomes, and the subsequent identification of the full spectrum of variants within the population. Identification of low frequency variants is important for our understandin ...[more]

PMID: 25886445

Similar Datasets

Project description:Wastewater surveillance for pathogens using reverse transcription-polymerase chain reaction (RT-PCR) is an effective and resource-efficient tool for gathering community-level public health information, including the incidence of coronavirus disease-19 (COVID-19). Surveillance of Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) in wastewater can potentially provide an early warning signal of COVID-19 infections in a community. The capacity of the world's environmental microbiology and virology laboratories for SARS-CoV-2 RNA characterization in wastewater is increasing rapidly. However, there are no standardized protocols or harmonized quality assurance and quality control (QA/QC) procedures for SARS-CoV-2 wastewater surveillance. This paper is a technical review of factors that can cause false-positive and false-negative errors in the surveillance of SARS-CoV-2 RNA in wastewater, culminating in recommended strategies that can be implemented to identify and mitigate some of these errors. Recommendations include stringent QA/QC measures, representative sampling approaches, effective virus concentration and efficient RNA extraction, PCR inhibition assessment, inclusion of sample processing controls, and considerations for RT-PCR assay selection and data interpretation. Clear data interpretation guidelines (e.g., determination of positive and negative samples) are critical, particularly when the incidence of SARS-CoV-2 in wastewater is low. Corrective and confirmatory actions must be in place for inconclusive results or results diverging from current trends (e.g., initial onset or reemergence of COVID-19 in a community). It is also prudent to perform interlaboratory comparisons to ensure results' reliability and interpretability for prospective and retrospective analyses. The strategies that are recommended in this review aim to improve SARS-CoV-2 characterization and detection for wastewater surveillance applications. A silver lining of the COVID-19 pandemic is that the efficacy of wastewater surveillance continues to be demonstrated during this global crisis. In the future, wastewater should also play an important role in the surveillance of a range of other communicable diseases.

Dataset Information

Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data.

Publications

Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets