Project description:The intensities from genotyping array data can be used to detect copy number variants (CNVs) but a high level of noise in the data and overlap between different copy-number intensity distributions produces unreliable calls, particularly when only a few probes are covered by the CNV. We present a novel pipeline (CamCNV) with a series of steps to reduce noise and detect more reliably CNVs covering as few as three probes. The pipeline aims to detect rare CNVs (below 1% frequency) for association tests in large cohorts. The method uses the information from all samples to convert intensities to z-scores, thus adjusting for variance between probes. We tested the sensitivity of our pipeline by looking for known CNVs from the 1000 Genomes Project in our genotyping of 1000 Genomes samples. We also compared the CNV calls for 1661 pairs of genotyped replicate samples. At the chosen mean z-score cut-off, sensitivity to detect the 1000 Genomes CNVs was approximately 85% for deletions and 65% for duplications. From the replicates, we estimate the false discovery rate is controlled at ∼10% for deletions (falling to below 3% with more than five probes) and ∼28% for duplications. The pipeline demonstrates improved sensitivity when compared to calling with PennCNV, particularly for short deletions covering only a few probes. For each called CNV, the mean z-score is a useful metric for controlling the false discovery rate.
Project description:Although copy number variants (CNVs) are important in genomic medicine, CNVs have not been systematically assessed for many complex traits. Several large rare CNVs increase risk for schizophrenia (SCZ) and autism and often demonstrate pleiotropic effects; however, their frequencies in the general population and other complex traits are unknown. Genotyping large numbers of samples is essential for progress. Large cohorts from many different diseases are being genotyped using exome-focused arrays designed to detect uncommon or rare protein-altering sequence variation. Although these arrays were not designed for CNV detection, the hybridization intensity data generated in each experiment could, in principle, be used for gene-focused CNV analysis. Our goal was to evaluate the extent to which CNVs can be detected using data from one particular exome array (the Illumina Human Exome Bead Chip). We genotyped 9100 Swedish subjects (3962 cases with SCZ and 5138 controls) using both standard genome-wide association study (GWAS) and exome arrays. In comparison with CNVs detected using GWAS arrays, we observed high sensitivity and specificity for detecting genic CNVs ?400?kb including known pathogenic CNVs along with replicating the literature finding that cases with SCZ had greater enrichment for genic CNVs. Our data confirm the association of SCZ with 16p11.2 duplications and 22q11.2 deletions, and suggest a novel association with deletions at 11q12.2. Our results suggest the utility of exome-focused arrays in surveying large genic CNVs in very large samples; and thereby open the door for new opportunities such as conducting well-powered CNV assessment and comparisons between different diseases. The use of a single platform also minimizes potential confounding factors that could impact accurate detection.
Project description:Genome-wide association studies have identified many common genetic variants which are associated with certain diseases. The identified common variants, however, explain only a small portion of the heritability of a complex disease phenotype. The missing heritability motivated researchers to test the hypothesis that rare variants influence common diseases. Next-generation sequencing technologies have made the studies of rare variants practicable. Quite a few statistical tests have been developed for exploiting the cumulative effect of a set of rare variants on a phenotype. The best-known sequence kernel association tests (SKATs) were developed for rare variants analysis of homogeneous genomes. In this chapter, we illustrate applications of the SKATs and offer several caveats regarding them. In particular, we address how to modify the SKATs to integrate local allele ancestries and calibrate the cryptic relatedness and population structure of admixed genomes.
Project description:Rare and low frequency variants are not well covered in most germline genotyping arrays and are understudied in relation to epithelial ovarian cancer (EOC) risk. To address this gap, we used genotyping arrays targeting rarer protein-coding variation in 8,165 EOC cases and 11,619 controls from the international Ovarian Cancer Association Consortium (OCAC). Pooled association analyses were conducted at the variant and gene level for 98,543 variants directly genotyped through two exome genotyping projects. Only common variants that represent or are in strong linkage disequilibrium (LD) with previously-identified signals at established loci reached traditional thresholds for exome-wide significance (P < 5.0 × 10 - 7). One of the most significant signals (Pall histologies = 1.01 × 10 - 13;Pserous = 3.54 × 10 - 14) occurred at 3q25.31 for rs62273959, a missense variant mapping to the LEKR1 gene that is in LD (r2 = 0.90) with a previously identified 'best hit' (rs7651446) mapping to an intron of TIPARP. Suggestive associations (5.0 × 10 - 5 > P≥5.0 ×10 - 7) were detected for rare and low-frequency variants at 16 novel loci. Four rare missense variants were identified (ACTBL2 rs73757391 (5q11.2), BTD rs200337373 (3p25.1), KRT13 rs150321809 (17q21.2) and MC2R rs104894658 (18p11.21)), but only MC2R rs104894668 had a large effect size (OR = 9.66). Genes most strongly associated with EOC risk included ACTBL2 (PAML = 3.23 × 10 - 5; PSKAT-o = 9.23 × 10 - 4) and KRT13 (PAML = 1.67 × 10 - 4; PSKAT-o = 1.07 × 10 - 5), reaffirming variant-level analysis. In summary, this large study identified several rare and low-frequency variants and genes that may contribute to EOC susceptibility, albeit with possible small effects. Future studies that integrate epidemiology, sequencing, and functional assays are needed to further unravel the unexplained heritability and biology of this disease.
Project description:Tracking genetic variations from positive SARS-CoV-2 samples yields crucial information about the number of variants circulating in an outbreak and the possible lines of transmission but sequencing every positive SARS-CoV-2 sample would be prohibitively costly for population-scale test and trace operations. Genotyping is a rapid, high-throughput and low-cost alternative for screening positive SARS-CoV-2 samples in many settings. We have designed a SNP identification pipeline to identify genetic variation using sequenced SARS-CoV-2 samples. Our pipeline identifies a minimal marker panel that can define distinct genotypes. To evaluate the system, we developed a genotyping panel to detect variants-identified from SARS-CoV-2 sequences surveyed between March and May 2020 and tested this on 50 stored qRT-PCR positive SARS-CoV-2 clinical samples that had been collected across the South West of the UK in April 2020. The 50 samples split into 15 distinct genotypes and there was a 61.9% probability that any two randomly chosen samples from our set of 50 would have a distinct genotype. In a high throughput laboratory, qRT-PCR positive samples pooled into 384-well plates could be screened with a marker panel at a cost of < £1.50 per sample. Our results demonstrate the usefulness of a SNP genotyping panel to provide a rapid, cost-effective, and reliable way to monitor SARS-CoV-2 variants circulating in an outbreak. Our analysis pipeline is publicly available and will allow for marker panels to be updated periodically as viral genotypes arise or disappear from circulation.
Project description:PurposeCohort-based germline variant characterization is the standard approach for pathogenic variant discovery in clinical and research samples. However, the impact of cohort size on the molecular diagnostic yield of joint genotyping is largely unknown.MethodsHead-to-head comparison of the molecular diagnostic yield of joint genotyping in two cohorts of 239 cancer patients in the absence and then in the presence of 100 additional germline exomes.ResultsIn 239 testicular cancer patients, 4 (7.4%, 95% confidence interval [CI]: 2.1-17.9) of 54 pathogenic variants in the cancer predisposition and American College of Medical Genetics and Genomics (ACMG) genes were missed by one or both computational runs of joint genotyping. Similarly, 8 (12.1%, 95% CI: 5.4-22.5) of 66 pathogenic variants in these genes were undetected by joint genotyping in another independent cohort of 239 breast cancer patients. An exome-wide analysis of putative loss-of-function (pLOF) variants in the testicular cancer cohort showed that 162 (8.2%, 95% CI: 7.1-9.6) pLOF variants were only detected in one analysis run but not the other, while 433 (22.0%, 95% CI: 20.2-23.9%) pLOF variants were filtered out by both analyses despite having sufficient sequencing coverage.ConclusionOur analysis of the standard germline variant detection method highlighted a substantial impact of concurrently analyzing additional genomic data sets on the ability to detect clinically relevant germline pathogenic variants.
Project description:Despite their unprecedented density, current SNP genotyping arrays contain large amounts of redundancy, with up to 40 oligonucleotide features used to query each SNP. By using publicly available reference genotype data from the International HapMap, we show that 93.6% sensitivity at <5% false positive rate can be obtained with only four probes per SNP, compared with 98.3% with the full data set. Removal of this redundancy will allow for more comprehensive whole-genome association studies with increased SNP density and larger sample sizes.
Project description:Haplotype-based methods are a cost-effective alternative to characterize unobserved rare variants and map disease-associated alleles. Moreover, they can be used to reconstruct recent population history, which shaped distribution of rare variants and thus can be used to guide gene mapping studies. In this study, we analysed Illumina 650 k genotyped dataset on three underrepresented populations from Eastern Europe, where ancestors of Russians came into contact with two indigenous ethnic groups, Bashkirs and Tatars. Using the IBD mapping approach, we identified two rare IBD haplotypes strongly enriched in asthma patients of distinct ethnic background. We reconstructed recent population history using haplotype-based methods to reconcile this contradictory finding. Our ChromoPainter analysis showed that these haplotypes each descend from a single ancestor coming from one of the ethnic groups studied. Next, we used DoRIS approach and showed that source populations for patients exchanged recent (<60 generations) asymmetric gene flow, which supported the ChromoPainter-based scenario that patients share haplotypes through inter-ethnic admixture. Finally, we show that these IBD haplotypes overlap with asthma-associated genomic regions ascertained in European population. This finding is consistent with the fact that the two donor populations for the rare IBD haplotypes: Russians and Tatars have European ancestry.
Project description:Large genome-wide association studies (GWAS) have been performed to detect common genetic variants involved in common diseases, but most of the variants found this way account for only a small portion of the trait variance. Furthermore, candidate gene-based resequencing suggests that many rare genetic variants contribute to the trait variance of common diseases. Here we propose two designs, sibpair and unrelated-case designs, to detect rare genetic variants in either a candidate gene-based or genome-wide association analysis. First we show that we can detect and classify together rare risk haplotypes using a relatively small sample with either of these designs, and then have increased power to test association in a larger case-control sample. This method can also be applied to resequencing data. Next we apply the method to the Wellcome Trust Case Control Consortium (WTCCC) coronary artery disease (CAD) and hypertension (HT) data, the latter being the only trait for which no genome-wide association evidence was reported in the original WTCCC study, and identify one interesting gene associated with HT and four associated with CAD at a genome-wide significance level of 5%. These results suggest that searching for rare genetic variants is feasible and can be fruitful in current GWAS, candidate gene studies or resequencing studies.
Project description:In recent years, a myriad of new statistical methods have been proposed for detecting associations of rare single-nucleotide variants (SNVs) with common diseases. These methods can be generally classified as 'collapsing' or 'haplotyping' based. The former is the predominant class, composed of most of the rare variant association methods proposed to date. However, recent works have suggested that haplotyping-based methods may offer advantages and can even be more powerful than collapsing methods in certain situations. In this article, we review and compare collapsing- versus haplotyping-based methods/software in terms of both power and type I error. For collapsing methods, we consider three approaches: Combined Multivariate and Collapsing, Sequence Kernel Association Test and Family-Based Association Test (FBAT): the first two are population based and are among the most popular; the last test is family based, a modification from the popular FBAT to accommodate rare SNVs. For haplotyping-based methods, we include Logistic Bayesian Lasso (LBL) for population data and family-based LBL (famLBL) for family (trio) data. These two methods are selected, as they can be used to test association for specific rare and common haplotypes. Our results show that haplotype methods can be more powerful than collapsing methods if there are interacting SNVs leading to larger haplotype effects. Even if only common SNVs are genotyped, haplotype methods can still detect specific rare haplotypes that tag rare causal SNVs. As expected, family-based methods are robust, whereas population-based methods are susceptible, to population substructure. However, the population-based haplotype approach appears to have smaller inflation of type I error than its collapsing counterparts.