Comparison of variant calling pipelines using Illumina CanineHD BeadChip array as the truth dataset
Ontology highlight
ABSTRACT: Next generation sequencing platforms have become essential tools for understanding DNA in a wide range of contexts. Their success heavily relies on the accuracy, sensitivity and specificity of methods used to discern differences between the reference genome and genomes under investigation. Here we compare the relative performances of five popular single nucleotide variant callers with and without their associated recommended hard filtering criteria. We compare: FreeBayes; the Genome Analysis Toolkit’s Haplotype Caller and Unified Genotyper; SAMtools; and VarScan. We tailor this comparison to suit smaller projects with modest sample numbers (n = 10) and coverage (~10X) to fill a current gap in the literature. Other comparison studies are generally applicable only to larger projects in model species, where there is access to large amounts of sequencing data and curated callsets for base and variant quality score recalibration. We estimated the accuracy, sensitivity and specificity of each pipeline according to the genotype concordance rate and number with the “truth” dataset for 10 canine samples. The truth dataset was defined as genotypes obtained from the CanineHD BeadChip array. Whole genome sequencing data was performed on the Illumina HiSeq2000 or HiSeq2500 platform as 100-101 base pair, paired end reads to an average sample coverage of 10.3X. Apart from GATK Haplotype Caller, applying recommended hard filters did not improve the performance of genotyping concordance at the tested levels of minimum coverage. The default VarScan pipeline with no additional filters applied (VarScan uses SAMtools mpileup, without base alignment quality computation) generally outperformed other callers in terms of accuracy, sensitivity and specificity. The results of this study demonstrate that hard filtering of variant calls from low-powered genome studies can impair accuracy, sensitivity and specificity of callsets and provides some benchmark performance metrics on a range of low coverage levels.
ORGANISM(S): Canis lupus familiaris
PROVIDER: GSE117010 | GEO | 2018/08/31
REPOSITORIES: GEO
ACCESS DATA