Dataset Information

Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver.

ABSTRACT: Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host diversity, including frequent indels, may have presented a barrier. In particular, mapping (aligning) reads to a reference sequence leads to biased loss of information; this bias can distort epidemiological and evolutionary conclusions. De novo assembly avoids this bias by aligning the reads to themselves, producing a set of sequences called contigs. However contigs provide only a partial summary of the reads, misassembly may result in their having an incorrect structure, and no information is available at parts of the genome where contigs could not be assembled. To address these problems we developed the tool shiver to pre-process reads for quality and contamination, then map them to a reference tailored to the sample using corrected contigs supplemented with the user's choice of existing reference sequences. Run with two commands per sample, it can easily be used for large heterogeneous data sets. We used shiver to reconstruct the consensus sequence and minority variant information from paired-end short-read whole-genome data produced with the Illumina platform, for sixty-five existing publicly available samples and fifty new samples. We show the systematic superiority of mapping to shiver's constructed reference compared with mapping the same reads to the closest of 3,249 real references: median values of 13 bases called differently and more accurately, 0 bases called differently and less accurately, and 205 bases of missing sequence recovered. We also successfully applied shiver to whole-genome samples of Hepatitis C Virus and Respiratory Syncytial Virus. shiver is publicly available from https://github.com/ChrisHIV/shiver.

SUBMITTER: Wymant C

PROVIDER: S-EPMC5961307 | biostudies-literature | 2018 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver.

Wymant Chris C Blanquart François F Golubchik Tanya T Gall Astrid A Bakker Margreet M Bezemer Daniela D Croucher Nicholas J NJ Hall Matthew M Hillebregt Mariska M Ong Swee Hoe SH Ratmann Oliver O Albert Jan J Bannert Norbert N Fellay Jacques J Fransen Katrien K Gourlay Annabelle A Grabowski M Kate MK Gunsenheimer-Bartmeyer Barbara B Günthard Huldrych F HF Kivelä Pia P Kouyos Roger R Laeyendecker Oliver O Liitsola Kirsi K Meyer Laurence L Porter Kholoud K Ristola Matti M van Sighem Ard A Berkhout Ben B Cornelissen Marion M Kellam Paul P Reiss Peter P Fraser Christophe C

Virus evolution 20180101 1

Studying the evolution of viruses and their molecular epidemiology relies on accurate viral sequence data, so that small differences between similar viruses can be meaningfully interpreted. Despite its higher throughput and more detailed minority variant data, next-generation sequencing has yet to be widely adopted for HIV. The difficulty of accurately reconstructing the consensus sequence of a quasispecies from reads (short fragments of DNA) in the presence of large between- and within-host div ...[more]

PMID: 29876136

Similar Datasets

Project description:BackgroundInsertion sequences (IS) are small transposable elements, commonly found in bacterial genomes. Identifying the location of IS in bacterial genomes can be useful for a variety of purposes including epidemiological tracking and predicting antibiotic resistance. However IS are commonly present in multiple copies in a single genome, which complicates genome assembly and the identification of IS insertion sites. Here we present ISMapper, a mapping-based tool for identification of the site and orientation of IS insertions in bacterial genomes, directly from paired-end short read data.ResultsISMapper was validated using three types of short read data: (i) simulated reads from a variety of species, (ii) Illumina reads from 5 isolates for which finished genome sequences were available for comparison, and (iii) Illumina reads from 7 Acinetobacter baumannii isolates for which predicted IS locations were tested using PCR. A total of 20 genomes, including 13 species and 32 distinct IS, were used for validation. ISMapper correctly identified 97 % of known IS insertions in the analysis of simulated reads, and 98 % in real Illumina reads. Subsampling of real Illumina reads to lower depths indicated ISMapper was able to correctly detect insertions for average genome-wide read depths >20x, although read depths >50x were required to obtain confident calls that were highly-supported by evidence from reads. All ISAba1 insertions identified by ISMapper in the A. baumannii genomes were confirmed by PCR. In each A. baumannii genome, ISMapper successfully identified an IS insertion upstream of the ampC beta-lactamase that could explain phenotypic resistance to third-generation cephalosporins. The utility of ISMapper was further demonstrated by profiling genome-wide IS6110 insertions in 138 publicly available Mycobacterium tuberculosis genomes, revealing lineage-specific insertions and multiple insertion hotspots.ConclusionsISMapper provides a rapid and robust method for identifying IS insertion sites directly from short read data, with a high degree of accuracy demonstrated across a wide range of bacteria.

Dataset Information

Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver.

Publications

Easy and accurate reconstruction of whole HIV genomes from short-read sequence data with shiver.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets