Dataset Information

Explicit Modeling of Ancestry Improves Polygenic Risk Scores and BLUP Prediction.

ABSTRACT: Polygenic prediction using genome-wide SNPs can provide high prediction accuracy for complex traits. Here, we investigate the question of how to account for genetic ancestry when conducting polygenic prediction. We show that the accuracy of polygenic prediction in structured populations may be partly due to genetic ancestry. However, we hypothesized that explicitly modeling ancestry could improve polygenic prediction accuracy. We analyzed three GWAS of hair color (HC), tanning ability (TA), and basal cell carcinoma (BCC) in European Americans (sample size from 7,440 to 9,822) and considered two widely used polygenic prediction approaches: polygenic risk scores (PRSs) and best linear unbiased prediction (BLUP). We compared polygenic prediction without correction for ancestry to polygenic prediction with ancestry as a separate component in the model. In 10-fold cross-validation using the PRS approach, the R(2) for HC increased by 66% (0.0456-0.0755; P < 10(-16)), the R(2) for TA increased by 123% (0.0154 to 0.0344; P < 10(-16)), and the liability-scale R(2) for BCC increased by 68% (0.0138-0.0232; P < 10(-16)) when explicitly modeling ancestry, which prevents ancestry effects from entering into each SNP effect and being overweighted. Surprisingly, explicitly modeling ancestry produces a similar improvement when using the BLUP approach, which fits all SNPs simultaneously in a single variance component and causes ancestry to be underweighted. We validate our findings via simulations, which show that the differences in prediction accuracy will increase in magnitude as sample sizes increase. In summary, our results show that explicitly modeling ancestry can be important in both PRS and BLUP prediction.

SUBMITTER: Chen CY

PROVIDER: S-EPMC4734143 | biostudies-literature | 2015 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Explicit Modeling of Ancestry Improves Polygenic Risk Scores and BLUP Prediction.

Chen Chia-Yen CY Han Jiali J Hunter David J DJ Kraft Peter P Price Alkes L AL

Genetic epidemiology 20150521 6

Polygenic prediction using genome-wide SNPs can provide high prediction accuracy for complex traits. Here, we investigate the question of how to account for genetic ancestry when conducting polygenic prediction. We show that the accuracy of polygenic prediction in structured populations may be partly due to genetic ancestry. However, we hypothesized that explicitly modeling ancestry could improve polygenic prediction accuracy. We analyzed three GWAS of hair color (HC), tanning ability (TA), and ...[more]

PMID: 25995153

Similar Datasets

Project description:BackgroundCardiovascular diseases (CVD) are a major health concern in Africa. Improved identification and treatment of high-risk individuals can reduce adverse health outcomes. Current CVD risk calculators are largely unvalidated in African populations and overlook genetic factors. Polygenic scores (PGS) can enhance risk prediction by measuring genetic susceptibility to CVD, but their effectiveness in genetically diverse populations is limited by a European-ancestry bias. To address this, we developed models integrating genetic data and conventional risk factors to assess the risk of developing cardiometabolic outcomes in African populations.MethodsWe used summary statistics from a genome-wide association meta-analysis (n = 14,126) in African populations to derive novel genome-wide PGS for 14 cardiometabolic traits in an independent African target sample (Africa Wits-INDEPTH Partnership for Genomic Research (AWI-Gen), n = 10,603). Regression analyses assessed relationships between each PGS and corresponding cardiometabolic trait, and seven CVD outcomes (CVD, heart attack, stroke, diabetes mellitus, dyslipidaemia, hypertension, and obesity). The predictive utility of the genetic data was evaluated using elastic net models containing multiple PGS (MultiPGS) and reference-projected principal components of ancestry (PPCs). An integrated risk prediction model incorporating genetic and conventional risk factors was developed. Nested cross-validation was used when deriving elastic net models to enhance generalisability.ResultsOur African-specific PGS displayed significant but variable within- and cross- trait prediction (max.R2 = 6.8%, p = 1.86 × 10-173). Significantly associated PGS with dyslipidaemia included the PGS for total cholesterol (logOR = 0.210, SE = 0.022, p = 2.18 × 10-21) and low-density lipoprotein (logOR = - 0.141, SE = 0.022, p = 1.30 × 10-20); with hypertension, the systolic blood pressure PGS (logOR = 0.150, SE = 0.045, p = 8.34 × 10-4); and multiple PGS associated with obesity: body mass index (max. logOR = 0.131, SE = 0.031, p = 2.22 × 10-5), hip circumference (logOR = 0.122, SE = 0.029, p = 2.28 × 10-5), waist circumference (logOR = 0.013, SE = 0.098, p = 8.13 × 10-4) and weight (logOR = 0.103, SE = 0.029, p = 4.89 × 10-5). Elastic net models incorporating MultiPGS and PPCs significantly improved prediction over MultiPGS alone. Models including genetic data and conventional risk factors were more predictive than conventional risk models alone (dyslipidaemia: R2 increase = 2.6%, p = 4.45 × 10-12; hypertension: R2 increase = 2.6%, p = 2.37 × 10-13; obesity: R2 increase = 5.5%, 1.33 × 10-34).ConclusionsIn African populations, CVD and associated cardiometabolic trait prediction models can be improved by incorporating ancestry-aligned PGS and accounting for ancestry. Combining PGS with conventional risk factors further enhances prediction over traditional models based on conventional factors. Incorporating data from target populations can improve the generalisability of international predictive models for CVD and associated traits in African populations.

Project description:ImportanceResearchers commonly use counts of diagnostic codes from EHR-linked biobanks to infer phenotypic status. However, these approaches overlook temporal changes in EHR data, such as the discontinuation or "dropout" of diagnostic codes, which may exacerbate disparities in genomics research, as EHR data quality can be confounded with demographic attributes.ObjectiveTo address this, we propose modeling diagnostic code dropout in EHR data to inform phenotyping for schizophrenia in genomic analyses.DesignWe develop and test our diagnostic dropout model by analyzing EHR data from individuals with prior schizophrenia diagnoses. We further validate model performance on a subset of patients whose diagnoses were attained through chart review. Using PRS-CS and existing GWAS summary statistics, we first extrapolate polygenic weights. Then, we apply our dropout model's outputs to construct a data-driven filter defining our target cohort for measuring polygenic score performance.SettingOur analysis utilizes EHR and genomic data from the Million Veteran Program.ParticipantsTo model diagnostic dropout in schizophrenia, we leverage data from 12,739 patients with a history of schizophrenia, after excluding outliers. For polygenic score analyses, we incorporate data from a potential pool of 8,385 European ancestry and 6,806 African ancestry patients with a history of schizophrenia.Main outcomes and measuresWe compare the performance of our diagnostic dropout model with alternative methodologies both in predicting diagnostic dropout on a holdout set, as well as on chart review labeled data. Using the top differential diagnosis predictors in our model, we select relevant cases by filtering out patients with a prior history of mood or anxiety disorders. We then test the impact of applying different filters for measuring polygenic score performance.ResultsWhen evaluated on chart review-labeled data, our model improves the area under the precision-recall curve (AUPRC) by 9.6% compared to competing methods. By applying our data-driven filter for schizophrenia, we achieve a 62% increase in the association effect size when transferring a European polygenic score to an African ancestry target cohort.Conclusions and relevanceThese findings highlight the potential of modeling diagnostic code dropout to enhance the phenotypic quality of EHR-linked biobank data, advancing more equitable and accurate genomics research across diverse populations.

Dataset Information

Explicit Modeling of Ancestry Improves Polygenic Risk Scores and BLUP Prediction.

Publications

Explicit Modeling of Ancestry Improves Polygenic Risk Scores and BLUP Prediction.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets