Dataset Information

Optimized Next-Generation Sequencing Genotype-Haplotype Calling for Genome Variability Analysis.

ABSTRACT: The accurate estimation of nucleotide variability using next-generation sequencing data is challenged by the high number of sequencing errors produced by new sequencing technologies, especially for nonmodel species, where reference sequences may not be available and the read depth may be low due to limited budgets. The most popular single-nucleotide polymorphism (SNP) callers are designed to obtain a high SNP recovery and low false discovery rate but are not designed to account appropriately the frequency of the variants. Instead, algorithms designed to account for the frequency of SNPs give precise results for estimating the levels and the patterns of variability. These algorithms are focused on the unbiased estimation of the variability and not on the high recovery of SNPs. Here, we implemented a fast and optimized parallel algorithm that includes the method developed by Roesti et al and Lynch, which estimates the genotype of each individual at each site, considering the possibility to call both bases from the genotype, a single one or none. This algorithm does not consider the reference and therefore is independent of biases related to the reference nucleotide specified. The pipeline starts from a BAM file converted to pileup or mpileup format and the software outputs a FASTA file. The new program not only reduces the running times but also, given the improved use of resources, it allows its usage with smaller computers and large parallel computers, expanding its benefits to a wider range of researchers. The output file can be analyzed using software for population genetics analysis, such as the R library PopGenome, the software VariScan, and the program mstatspop for analysis considering positions with missing data.

SUBMITTER: Navarro J

PROVIDER: S-EPMC5582667 | biostudies-literature | 2017

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Optimized Next-Generation Sequencing Genotype-Haplotype Calling for Genome Variability Analysis.

Navarro Javier J Nevado Bruno B Hernández Porfidio P Vera Gonzalo G Ramos-Onsins Sebastián E SE

Evolutionary bioinformatics online 20170824

The accurate estimation of nucleotide variability using next-generation sequencing data is challenged by the high number of sequencing errors produced by new sequencing technologies, especially for nonmodel species, where reference sequences may not be available and the read depth may be low due to limited budgets. The most popular single-nucleotide polymorphism (SNP) callers are designed to obtain a high SNP recovery and low false discovery rate but are not designed to account appropriately the ...[more]

PMID: 28894353

Similar Datasets

Project description:A fundamental challenge in analyzing next-generation sequencing (NGS) data is to determine an individual's genotype accurately, as the accuracy of the inferred genotype is essential to downstream analyses. Correctly estimating the base-calling error rate is critical to accurate genotype calls. Phred scores that accompany each call can be used to decide which calls are reliable. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too high threshold may lose data, while a too low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The approach, which we call PhredEM, uses the expectation-maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. It also includes a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be nonmonomorphic require application of the EM algorithm. Like GATK, PhredEM can be used together with a linkage-disequilibrium-based method such as Beagle, which can further improve genotype calling as a refinement step. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project and the 1000 Genomes project. The results demonstrate that PhredEM performs better than either GATK or SeqEM, and that PhredEM is an improved, robust, and widely applicable genotype-calling approach for NGS studies. The relevant software is freely available.

Project description:Targeted genomic sequencing (TS) greatly benefits precision oncology by rapidly detecting genetic variations with better accuracy and sensitivity owing to its high sequencing depth. Multiple sequencing platforms and variant calling tools are available for TS, making it excruciating for researchers to choose. Therefore, benchmarking study across different platforms and pipelines available for TS is imperative. In this study, we performed a TS of Reference OncoSpan FFPE (HD832) sample enriched by TSO500 panel using four commercially available sequencers, and analyzed the output 50 datasets using five commonly-used bioinformatics pipelines. We systematically investigated the sequencing quality and variant detection sensitivity, expecting to provide optimal recommendations for future research. Four sequencing platforms returned highly concordant results in terms of base quality (Q20 > 94%), sequencing coverage (>97%) and depth (>2000×). Benchmarking revealed good concordance of variant calling across different platforms and pipelines, among which, FASTASeq 300 platform showed the highest sensitivity (100%) and precision (100%) in high-confidence variants calling when analyzed by SNVer and VarScan 2 algorithms. Furthermore, this sequencer demonstrated the shortest sequencing time (∼21 h) at the sequencing mode PE150. Through the intersection of 50 datasets generated in this study, we recommended a novel set of variant genes outside the truth set published by HD832, expecting to replenish HD832 for future research on tumor variant diagnosis. Besides, we applied these five tools to another panel (TargetSeq One) for Twist cfDNA Pan-cancer Reference Standard, comprehensive consideration of SNP and InDel sensitivity, SNVer and VarScan 2 performed best among them. Furthermore, SNVer and VarScan 2 also performed best for six cancer cell lines samples regarding SNP and InDel sensitivity. Considering the dissimilarity of variant calls across different pipelines for datasets from the same platform, we recommended an integration of multiple tools to improve variant calling sensitivity and accuracy for the cancer genome. Illumina and GeneMind technologies can be used independently or together by public health laboratories performing tumor TS. SNVer and VarScan 2 perform better regarding variant detection sensitivity for three typical tumor samples. Our study provides a standardized target sequencing resource to benchmark new bioinformatics protocols and sequencing platforms.

Dataset Information

Optimized Next-Generation Sequencing Genotype-Haplotype Calling for Genome Variability Analysis.

Publications

Optimized Next-Generation Sequencing Genotype-Haplotype Calling for Genome Variability Analysis.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets