Dataset Information

Evaluation of variant identification methods for whole genome sequencing data in dairy cattle.

ABSTRACT: BACKGROUND: Advances in human genomics have allowed unprecedented productivity in terms of algorithms, software, and literature available for translating raw next-generation sequence data into high-quality information. The challenges of variant identification in organisms with lower quality reference genomes are less well documented. We explored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four publicly available software applications (Platypus, HaplotypeCaller, Samtools and UnifiedGenotyper) on whole genome sequence data of 65 key ancestors of Swiss dairy cattle populations. Accuracy of calling next-generation sequence variants was assessed by comparison to the same loci from medium and high-density single nucleotide variant (SNV) arrays. RESULTS: The total number of SNVs identified varied by software and method, with single (multi) sample results ranging from 17.7 to 22.0 (16.9 to 22.0) million variants. Computing time varied considerably between software. Preparatory realignment of insertions and deletions and subsequent base quality score recalibration had only minor effects on the number and quality of SNVs identified by different software, but increased computing time considerably. Average concordance for single (multi) sample results with high-density chip data was 58.3% (87.0%) and average genotype concordance in correctly identified SNVs was 99.2% (99.2%) across software. The average quality of SNVs identified, measured as the ratio of transitions to transversions, was higher using single sample methods than multi sample methods. A consensus approach using results of different software generally provided the highest variant quality in terms of transition/transversion ratio. CONCLUSIONS: Our findings serve as a reference for variant identification pipeline development in non-human organisms and help assess the implication of preparatory steps in next-generation sequencing pipelines for organisms with incomplete reference genomes (pipeline code is included). Benchmarking this information should prove particularly useful in processing next-generation sequencing data for use in genome-wide association studies and genomic selection.

SUBMITTER: Baes CF

PROVIDER: S-EPMC4289218 | biostudies-literature | 2014

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evaluation of variant identification methods for whole genome sequencing data in dairy cattle.

Baes Christine F CF Dolezal Marlies A MA Koltes James E JE Bapst Beat B Fritz-Waters Eric E Jansen Sandra S Flury Christine C Signer-Hasler Heidi H Stricker Christian C Fernando Rohan R Fries Ruedi R Moll Juerg J Garrick Dorian J DJ Reecy James M JM Gredler Birgit B

BMC genomics 20141101

<h4>Background</h4>Advances in human genomics have allowed unprecedented productivity in terms of algorithms, software, and literature available for translating raw next-generation sequence data into high-quality information. The challenges of variant identification in organisms with lower quality reference genomes are less well documented. We explored the consequences of commonly recommended preparatory steps and the effects of single and multi sample variant identification methods using four p ...[more]

PMID: 25361890

Similar Datasets

Project description:BackgroundQuantitative genetic studies suggest the existence of variation at the genome level that affects the ability of cattle to resist to parasitic diseases. The objective of the current study was to identify regions of the bovine genome that are associated with resistance to endo-parasites.MethodsIndividual cattle records were available for Fasciola hepatica-damaged liver from 18 abattoirs. Deregressed estimated breeding values (EBV) for F. hepatica-damaged liver were generated for genotyped animals with a record for F. hepatica-damaged liver and for genotyped sires with a least one progeny record for F. hepatica-damaged liver; 3702 animals were available. In addition, individual cow records for antibody response to F. hepatica on 6388 genotyped dairy cows, antibody response to Ostertagia ostertagi on 8334 genotyped dairy cows and antibody response to Neospora caninum on 4597 genotyped dairy cows were adjusted for non-genetic effects. Genotypes were imputed to whole-sequence; after edits, 14,190,141 single nucleotide polymorphisms (SNPs) and 16,603,644 SNPs were available for cattle with deregressed EBV for F. hepatica-damaged liver and cows with an antibody response to a parasitic disease, respectively. Association analyses were undertaken using linear regression on one SNP at a time, in which a genomic relationship matrix accounted for the relationships between animals.ResultsGenomic regions for F. hepatica-damaged liver were located on Bos taurus autosomes (BTA) 1, 8, 11, 16, 17 and 18; each region included at least one SNP with a p value lower than 10-6. Five SNPs were identified as significant (q value < 0.05) for antibody response to N. caninum and were located on BTA21 or 25. For antibody response to F. hepatica and O. ostertagi, six and nine quantitative trait loci (QTL) regions that included at least one SNP with a p value lower than 10-6 were identified, respectively. Gene set enrichment analysis revealed a significant association between functional annotations related to the olfactory system and QTL that were suggestively associated with endo-parasite phenotypes.ConclusionsA number of novel genomic regions were suggestively associated with endo-parasite phenotypes across the bovine genome and two genomic regions on BTA21 and 25 were associated with antibody response to N. caninum.

Dataset Information

Evaluation of variant identification methods for whole genome sequencing data in dairy cattle.

Publications

Evaluation of variant identification methods for whole genome sequencing data in dairy cattle.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets