Dataset Information

PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects.

ABSTRACT: MOTIVATION:Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery. RESULTS:In this paper, we propose PopCluster: an algorithm to automatically discover subsets of individuals in which the genetic effects of a variant are statistically different. PopCluster provides a simple framework to directly analyze genotype data without prior knowledge of subjects' ethnicities. PopCluster combines logistic regression modeling, principal component analysis, hierarchical clustering and a recursive bottom-up tree parsing procedure. The evaluation of PopCluster suggests that the algorithm has a stable low false positive rate (?4%) and high true positive rate (>80%) in simulations with large differences in allele frequencies between cases and controls. Application of PopCluster to data from genetic studies of longevity discovers ethnicity-dependent heterogeneity in the association of rs3764814 (USP42) with the phenotype. AVAILABILITY AND IMPLEMENTATION:PopCluster was implemented using the R programming language, PLINK and Eigensoft software, and can be found at the following GitHub repository: https://github.com/gurinovich/PopCluster with instructions on its installation and usage. SUPPLEMENTARY INFORMATION:Supplementary data are available at Bioinformatics online.

SUBMITTER: Gurinovich A

PROVIDER: S-EPMC6735784 | biostudies-literature | 2019 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects.

Gurinovich Anastasia A Bae Harold H Farrell John J JJ Andersen Stacy L SL Monti Stefano S Puca Annibale A Atzmon Gil G Barzilai Nir N Perls Thomas T TT Sebastiani Paola P

Bioinformatics (Oxford, England) 20190901 17

<h4>Motivation</h4>Over the last decade, more diverse populations have been included in genome-wide association studies. If a genetic variant has a varying effect on a phenotype in different populations, genome-wide association studies applied to a dataset as a whole may not pinpoint such differences. It is especially important to be able to identify population-specific effects of genetic variants in studies that would eventually lead to development of diagnostic tests or drug discovery.<h4>Resu ...[more]

PMID: 30624692

Similar Datasets

Project description:BackgroundHypophosphatasia (HPP) is a rare and underdiagnosed condition characterized by deficient bone and teeth mineralization. The aim of this study was first, to evaluate the diagnostic utility of employing alkaline phosphatase (ALP) threshold levels to identify adults with variants in ALPL among individuals with persistently low ALP levels and second, to determine the value of also including its substrates (serum pyridoxal-5'-phosphate-PLP-and urinary phosphoetanolamine-PEA) for this purpose in order to create a biochemical algorithm that could facilitate the diagnostic work-up of HPP.ResultsThe study population comprised 77 subjects with persistent hypophosphatasaemia. They were divided into two groups according to the presence (+GT) or absence (-GT) of pathogenic ALPL variants: 40 +GT and 37 -GT. Diagnostic utility measures were calculated for different ALP thresholds and Receiver Operating Characteristic (ROC) curves were employed to determine PLP and PEA optimal cut-off levels to predict the presence of variants. The optimal threshold for ALP was 25 IU/L; for PLP, 180 nmol/L and for PEA, 30 µmol/g creatinine. Biochemical predictive models were assessed using binary logistic regression analysis and bootstrapping machine learning technique and results were then validated. For ALP < 25 UI/L (model 1), the area under curve (AUC) and the 95% confidence intervals (CI) was 0.68 (95% CI 0.63-0.72) and it improved to 0.87 (95% CI 0.8-0.9), when PEA or PLP threshold levels were added (models 2 and 3), reaching 0.94 (0.91-0.97) when both substrates were included (model 4). The internal validation showed that the addition of serum PLP threshold levels to the model just including ALP improved significantly sensitivity (S) and negative predictive value (NPV) - 100%, respectively- with an accuracy (AC) of 93% in comparison to the inclusion of urinary PEA (S: 71%; NPV 75% and AC: 79%) and similar diagnostic utility measures as those observed in model 3 were detected when both substrates were added.ConclusionsIn this study, we propose a biochemical predictive model based on the threshold levels of the main biochemical markers of HPP (ALP < 25 IU/L and PLP > 180 nmol/L) that when combined, seem to be very useful to identify individuals with ALPL variants.

Dataset Information

PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects.

Publications

PopCluster: an algorithm to identify genetic variants with ethnicity-dependent effects.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets