Dataset Information

Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic.

ABSTRACT: Sufficiently powered case-control studies with next-generation sequence (NGS) data remain prohibitively expensive for many investigators. If feasible, a more efficient strategy would be to include publicly available sequenced controls. However, these studies can be confounded by differences in sequencing platform; alignment, single nucleotide polymorphism and variant calling algorithms; read depth; and selection thresholds. Assuming one can match cases and controls on the basis of ethnicity and other potential confounding factors, and one has access to the aligned reads in both groups, we investigate the effect of systematic differences in read depth and selection threshold when comparing allele frequencies between cases and controls. We propose a novel likelihood-based method, the robust variance score (RVS), that substitutes genotype calls by their expected values given observed sequence data.We show theoretically that the RVS eliminates read depth bias in the estimation of minor allele frequency. We also demonstrate that, using simulated and real NGS data, the RVS method controls Type I error and has comparable power to the 'gold standard' analysis with the true underlying genotypes for both common and rare variants.An RVS R script and instructions can be found at strug.research.sickkids.ca, and at https://github.com/strug-lab/RVS.lisa.strug@utoronto.caSupplementary data are available at Bioinformatics online.

SUBMITTER: Derkach A

PROVIDER: S-EPMC4103600 | biostudies-literature | 2014 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic.

Derkach Andriy A Chiang Theodore T Gong Jiafen J Addis Laura L Dobbins Sara S Tomlinson Ian I Houlston Richard R Pal Deb K DK Strug Lisa J LJ

Bioinformatics (Oxford, England) 20140414 15

<h4>Motivation</h4>Sufficiently powered case-control studies with next-generation sequence (NGS) data remain prohibitively expensive for many investigators. If feasible, a more efficient strategy would be to include publicly available sequenced controls. However, these studies can be confounded by differences in sequencing platform; alignment, single nucleotide polymorphism and variant calling algorithms; read depth; and selection thresholds. Assuming one can match cases and controls on the basi ...[more]

PMID: 24733292

Similar Datasets

Project description:ABSTRACT Clostridium perfringens is a spore-forming anaerobic pathogen responsible for a variety of histotoxic and intestinal infections in humans and animals. High-resolution genotyping aiming to identify bacteria at strain level has become increasingly important in modern microbiology to understand pathogen transmission pathways and to tackle infection sources. This study aimed at establishing a publicly available genome-wide multilocus sequence-typing (MLST) scheme for C. perfringens. A total of 1,431 highly conserved core genes (1.34 megabases; 50% of the reference genome genes) were indexed for a core genome-based MLST (cgMLST) scheme for C. perfringens. The scheme was applied to 282 ecologically and geographically diverse genomes, showing that the genotyping results of cgMLST were highly congruent with the core genome-based single-nucleotide-polymorphism typing in terms of resolution and tree topology. In addition, the cgMLST provided a greater discrimination than classical MLST methods for C. perfringens. The usability of the scheme for outbreak analysis was confirmed by reinvestigating published outbreaks of C. perfringens-associated infections in the United States and the United Kingdom. In summary, a publicly available scheme and an allele nomenclature database for genomic typing of C. perfringens have been established and can be used for broad-based and standardized epidemiological studies. IMPORTANCE Global epidemiological surveillance of bacterial pathogens is enhanced by the availability of standard tools and sharing of typing data. The use of whole-genome sequencing has opened the possibility for high-resolution characterization of bacterial strains down to the clonal and subclonal levels. Core genome multilocus sequence typing is a robust system that uses highly conserved core genes for deep genotyping. The method has been successfully and widely used to describe the epidemiology of various bacterial species. Nevertheless, a cgMLST typing scheme for Clostridium perfringens is currently not publicly available. In this study, we (i) developed a cgMLST typing scheme for C. perfringens, (ii) evaluated the performance of the scheme on different sets of C. perfringens genomes from different hosts and geographic regions as well as from different outbreak situations, and, finally, (iii) made this scheme publicly available supported by an allele nomenclature database for global and standard genomic typing.

Dataset Information

Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic.

Publications

Association analysis using next-generation sequence data from publicly available control groups: the robust variance score statistic.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets