Dataset Information

Efficient error correction for next-generation sequencing of viral amplicons.

ABSTRACT:

Background

Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent quality assessment of amplicon sequences obtained using 454-sequencing showed that the error rate is strongly linked to the presence and size of homopolymers, position in the sequence and length of the amplicon. All these parameters are strongly sequence specific and should be incorporated into the calibration of error-correction algorithms designed for amplicon sequencing.

Results

In this paper, we present two new efficient error correction algorithms optimized for viral amplicons: (i) k-mer-based error correction (KEC) and (ii) empirical frequency threshold (ET). Both were compared to a previously published clustering algorithm (SHORAH), in order to evaluate their relative performance on 24 experimental datasets obtained by 454-sequencing of amplicons with known sequences. All three algorithms show similar accuracy in finding true haplotypes. However, KEC and ET were significantly more efficient than SHORAH in removing false haplotypes and estimating the frequency of true ones.

Conclusions

Both algorithms, KEC and ET, are highly suitable for rapid recovery of error-free haplotypes obtained by 454-sequencing of amplicons from heterogeneous viruses.The implementations of the algorithms and data sets used for their testing are available at: http://alan.cs.gsu.edu/NGS/?q=content/pyrosequencing-error-correction-algorithm.

SUBMITTER: Skums P

PROVIDER: S-EPMC3382444 | biostudies-literature | 2012 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Efficient error correction for next-generation sequencing of viral amplicons.

Skums Pavel P Dimitrova Zoya Z Campo David S DS Vaughan Gilberto G Rossi Livia L Forbi Joseph C JC Yokosawa Jonny J Zelikovsky Alex A Khudyakov Yury Y

BMC bioinformatics 20120625

<h4>Background</h4>Next-generation sequencing allows the analysis of an unprecedented number of viral sequence variants from infected patients, presenting a novel opportunity for understanding virus evolution, drug resistance and immune escape. However, sequencing in bulk is error prone. Thus, the generated data require error identification and correction. Most error-correction methods to date are not optimized for amplicon analysis and assume that the error rate is randomly distributed. Recent ...[more]

PMID: 22759430

Similar Datasets

Project description:The analysis of HIV-1 sequences has helped understand the viral molecular epidemiology, monitor the development of antiretroviral drug resistance, and design candidate vaccines. The introduction of single genome amplification (SGA) has been a major advancement in the field, allowing for the characterization of multiple sequences per patient while preserving linkage among polymorphisms in the same viral genome copy. Sequencing of SGA amplicons is performed by capillary Sanger sequencing, which presents low throughput, requires a high amount of template, and is highly sensitive to template/primer mismatching. In order to meet the increasing demand for HIV-1 SGA amplicon sequencing, we have developed a platform based on benchtop next-generation sequencing (NGS) (IonTorrent) accompanied by a bioinformatics pipeline capable of running on computer resources commonly available at research laboratories. During assay validation, the NGS-based sequencing of 10 HIV-1 env SGA amplicons was fully concordant with Sanger sequencing. The field test was conducted on plasma samples from 10 US Navy and Marine service members with recent HIV-1 infection (sampling interval: 2005-2010; plasma viral load: 5,884-194,984 copies/ml). The NGS analysis of 101 SGA amplicons (median: 10 amplicons/individual) showed within-individual viral sequence profiles expected in individuals at this disease stage, including individuals with highly homogeneous quasispecies, individuals with two highly homogeneous viral lineages, and individuals with heterogeneous viral populations. In a scalability assessment using the Ion Chef automated system, 41/43 tested env SGA amplicons (95%) multiplexed on a single Ion 318 chip showed consistent gene-wide coverage >50×. With lower sample requirements and higher throughput, this approach is suitable to support the increasing demand for high-quality and cost-effective HIV-1 sequences in fields such as molecular epidemiology, and development of preventive and therapeutic strategies.

Dataset Information

Efficient error correction for next-generation sequencing of viral amplicons.

Background

Results

Conclusions

Publications

Efficient error correction for next-generation sequencing of viral amplicons.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets