Project description:The sequential paradigm of data acquisition and analysis in next-generation sequencing leads to high turnaround times for the generation of interpretable results. We combined a novel real-time read mapping algorithm with fast variant calling to obtain reliable variant calls still during the sequencing process. Thereby, our new algorithm allows for accurate read mapping results for intermediate cycles and supports large reference genomes such as the complete human reference. This enables the combination of real-time read mapping results with complex follow-up analysis. In this study, we showed the accuracy and scalability of our approach by applying real-time read mapping and variant calling to seven publicly available human whole exome sequencing datasets. Thereby, up to 89% of all detected SNPs were already identified after 40 sequencing cycles while showing similar precision as at the end of sequencing. Final results showed similar accuracy to those of conventional post-hoc analysis methods. When compared to standard routines, our live approach enables considerably faster interventions in clinical applications and infectious disease outbreaks. Besides variant calling, our approach can be adapted for a plethora of other mapping-based analyses.
Project description:Multi-sample pooling and Illumina Genome Analyzer (GA) sequencing allows high throughput sequencing of multiple samples to determine population sequence variation. A preliminary experiment, using the RET proto-oncogene as a model, predicted ? 30 samples could be pooled to reliably detect singleton variants without requiring additional confirmation testing. This report used 30 and 50 sample pools to test the hypothesized pooling limit and also to test recent protocol improvements, Illumina GAIIx upgrades, and longer read chemistry. The SequalPrep(TM) method was used to normalize amplicons before pooling. For comparison, a single 'control' sample was run in a different flow cell lane. Data was evaluated by variant read percentages and the subtractive correction method which utilizes the control sample. In total, 59 variants were detected within the pooled samples, which included all 47 known true variants. The 15 known singleton variants due to Sanger sequencing had an average of 1.62 ± 0.26% variant reads for the 30 pool (expected 1.67% for a singleton variant [unique variant within the pool]) and 1.01 ± 0.19% for the 50 pool (expected 1%). The 76 base read lengths had higher error rates than shorter read lengths (33 and 50 base reads), which eliminated the distinction of true singleton variants from background error. This report demonstrated pooling limits from 30 up to 50 samples (depending on error rates and coverage), for reliable singleton variant detection. The presented pooling protocols and analysis methods can be used for variant discovery in other genes, facilitating molecular diagnostic test design and interpretation.
Project description:MotivationNext-generation deep sequencing of viral genomes, particularly on the Illumina platform, is increasingly applied in HIV research. Yet, there is no standard protocol or method used by the research community to account for measurement errors that arise during sample preparation and sequencing. Correctly calling high and low-frequency variants while controlling for erroneous variants is an important precursor to downstream interpretation, such as studying the emergence of HIV drug-resistance mutations, which in turn has clinical applications and can improve patient care.ResultsWe developed a new variant-calling pipeline, hivmmer, for Illumina sequences from HIV viral genomes. First, we validated hivmmer by comparing it to other variant-calling pipelines on real HIV plasmid datasets. We found that hivmmer achieves a lower rate of erroneous variants, and that all methods agree on the frequency of correctly called variants. Next, we compared the methods on an HIV plasmid dataset that was sequenced using Primer ID, an amplicon-tagging protocol, which is designed to reduce errors and amplification bias during library preparation. We show that the Primer ID consensus exhibits fewer erroneous variants compared to the variant-calling pipelines, and that hivmmer more closely approaches this low error rate compared to the other pipelines. The frequency estimates from the Primer ID consensus do not differ significantly from those of the variant-calling pipelines.Availability and implementationhivmmer is freely available for non-commercial use from https://github.com/kantorlab/hivmmer.Supplementary informationSupplementary data are available at Bioinformatics online.
Project description:The ongoing COVID-19 pandemic, caused by SARS-CoV-2, constitutes a tremendous global health issue. Continuous monitoring of the virus has become a cornerstone to make rational decisions on implementing societal and sanitary measures to curtail the virus spread. Additionally, emerging SARS-CoV-2 variants have increased the need for genomic surveillance to detect particular strains because of their potentially increased transmissibility, pathogenicity and immune escape. Targeted SARS-CoV-2 sequencing of diagnostic and wastewater samples has been explored as an epidemiological surveillance method for the competent authorities. Currently, only the consensus genome sequence of the most abundant strain is taken into consideration for analysis, but multiple variant strains are now circulating in the population. Consequently, in diagnostic samples, potential co-infection(s) by several different variants can occur or quasispecies can develop during an infection in an individual. In wastewater samples, multiple variant strains will often be simultaneously present. Currently, quality criteria are mainly available for constructing the consensus genome sequence, and some guidelines exist for the detection of co-infections and quasispecies in diagnostic samples. The performance of detection and quantification of low-frequency variants using whole genome sequencing (WGS) of SARS-CoV-2 remains largely unknown. Here, we evaluated the detection and quantification of mutations present at low abundances using the mutations defining the SARS-CoV-2 lineage B.1.1.7 (alpha variant) as a case study. Real sequencing data were in silico modified by introducing mutations of interest into raw wild-type sequencing data, or by mixing wild-type and mutant raw sequencing data, to construct mixed samples subjected to WGS using a tiling amplicon-based targeted metagenomics approach and Illumina sequencing. As anticipated, higher variation and lower sensitivity were observed at lower coverages and allelic frequencies. We found that detection of all low-frequency variants at an abundance of 10, 5, 3, and 1%, requires at least a sequencing coverage of 250, 500, 1500, and 10,000×, respectively. Although increasing variability of estimated allelic frequencies at decreasing coverages and lower allelic frequencies was observed, its impact on reliable quantification was limited. This study provides a highly sensitive low-frequency variant detection approach, which is publicly available at https://galaxy.sciensano.be, and specific recommendations for minimum sequencing coverages to detect clade-defining mutations at certain allelic frequencies. This approach will be useful to detect and quantify low-frequency variants in both diagnostic (e.g., co-infections and quasispecies) and wastewater [e.g., multiple variants of concern (VOCs)] samples.
Project description:BACKGROUND: Resequencing of deafness related genes using GS FLX massive parallel sequencing of PCR amplicons spanning selected genes has previously been reported as a successful strategy to discover causal variants. The amplicon lengths were designed to be smaller than the sequencing read length of GS FLX technology, but are longer than Illumina sequencing technology read lengths. Fragmentation is thus required to sequence these amplicons using high throughput Illumina technology. METHODS: We performed Illumina sequencing in 4 patients on 563 multiplexed amplicons covering the exons of 15 genes involved in the hearing process. After exploring several fragmentation strategies, the amplicons were fragmented using Covaris sonication prior to library preparation. CLC genomic workbench was used to analyze the data. RESULTS: We achieve an excellent coverage with more than 99% of the amplicons bases covered. All variants that were previously validated using Sanger sequencing, were also called in this study. Variant calling revealed less false positive and false negative results compared to the previous study. For each patient, several variants were found that are reported by ClinVar as possible hearing loss variants. CONCLUSION: Migration from GS FLX amplicon sequencing to Illumina amplicon sequencing is straightforward and leads to more accurate results.