Dataset Information

Estimating Identification Disclosure Risk Using Mixed Membership Models.

ABSTRACT: Statistical agencies and other organizations that disseminate data are obligated to protect data subjects' confidentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that sample uniques on sets of discrete keys are also population uniques on those keys. This is typically done using log-linear modeling on the keys. However, log-linear models can yield biased estimates of cell probabilities for sparse contingency tables with many zero counts, which often occurs in databases with many keys. This bias can result in unreliable estimates of probabilities of uniqueness and, hence, misrepresentations of disclosure risks. We propose an alternative to log-linear models for datasets with sparse keys based on a Bayesian version of grade of membership (GoM) models. We present a Bayesian GoM model for multinomial variables and offer an MCMC algorithm for fitting the model. We evaluate the approach by treating data from a recent US Census Bureau public use microdata sample as a population, taking simple random samples from that population, and benchmarking estimated probabilities of uniqueness against population values. Compared to log-linear models, GoM models provide more accurate estimates of the total number of uniques in the samples. Additionally, they offer record-level predictions of uniqueness that dominate those based on log-linear models.

SUBMITTER: Manrique-Vallier D

PROVIDER: S-EPMC4159106 | biostudies-literature | 2012 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Estimating Identification Disclosure Risk Using Mixed Membership Models.

Manrique-Vallier Daniel D Reiter Jerome P JP

Journal of the American Statistical Association 20121201 500

Statistical agencies and other organizations that disseminate data are obligated to protect data subjects' confidentiality. For example, ill-intentioned individuals might link data subjects to records in other databases by matching on common characteristics (keys). Successful links are particularly problematic for data subjects with combinations of keys that are unique in the population. Hence, as part of their assessments of disclosure risks, many data stewards estimate the probabilities that s ...[more]

PMID: 25214699

Similar Datasets

Project description:Age-related macular degeneration (AMD) is the leading cause of irreversible visual loss in the elderly in developed countries and typically affects more than 10% of individuals over age 80. AMD has a large genetic component, with heritability estimated to be between 45% and 70%. Numerous variants have been identified and implicate various molecular mechanisms and pathways for AMD pathogenesis but those variants only explain a portion of AMD's heritability. The goal of our study was to estimate the cumulative genetic contribution of common variants on AMD risk for multiple pathways related to the etiology of AMD, including angiogenesis, antioxidant activity, apoptotic signaling, complement activation, inflammatory response, response to nicotine, oxidative phosphorylation, and the tricarboxylic acid cycle. While these mechanisms have been associated with AMD in literature, the overall extent of the contribution to AMD risk for each is unknown.In a case-control dataset with 1,813 individuals genotyped for over 600,000 SNPs we used Genome-wide Complex Trait Analysis (GCTA) to estimate the proportion of AMD risk explained by SNPs in genes associated with each pathway. SNPs within a 50 kb region flanking each gene were also assessed, as well as more distant, putatively regulatory SNPs, based on DNaseI hypersensitivity data from ocular tissue in the ENCODE project.We found that 19 previously associated AMD risk SNPs contributed to 13.3% of the risk for AMD in our dataset, while the remaining genotyped SNPs contributed to 36.7% of AMD risk. Adjusting for the 19 risk SNPs, the complement activation and inflammatory response pathways still explained a statistically significant proportion of additional risk for AMD (9.8% and 17.9%, respectively), with other pathways showing no significant effects (0.3% - 4.4%).Our results show that SNPs associated with complement activation and inflammation significantly contribute to AMD risk, separately from the risk explained by the 19 known risk SNPs. We found that SNPs within 50 kb regions flanking genes explained additional risk beyond genic SNPs, suggesting a potential regulatory role, but that more distant SNPs explained less than 0.5% additional risk for each pathway.From these analyses we find that the impact of complement SNPs on risk for AMD extends beyond the established genome-wide significant SNPs.

Project description:ImportanceThe clinical value of current multifactorial algorithms for individualized assessment of dementia risk remains unclear.ObjectiveTo evaluate the clinical value associated with 4 widely used dementia risk scores in estimating 10-year dementia risk.Design, setting, and participantsThis prospective population-based UK Biobank cohort study assessed 4 dementia risk scores at baseline (2006-2010) and ascertained incident dementia during the following 10 years. Replication with a 20-year follow-up was based on the British Whitehall II study. For both analyses, participants who had no dementia at baseline, had complete data on at least 1 dementia risk score, and were linked to electronic health records from hospitalizations or mortality were included. Data analysis was conducted from July 5, 2022, to April 20, 2023.ExposuresFour existing dementia risk scores: the Cardiovascular Risk Factors, Aging and Dementia (CAIDE)-Clinical score, the CAIDE-APOE-supplemented score, the Brief Dementia Screening Indicator (BDSI), and the Australian National University Alzheimer Disease Risk Index (ANU-ADRI).Main outcomes and measuresDementia was ascertained from linked electronic health records. To evaluate how well each score predicted the 10-year risk of dementia, concordance (C) statistics, detection rate, false-positive rate, and the ratio of true to false positives were calculated for each risk score and for a model including age alone.ResultsOf 465 929 UK Biobank participants without dementia at baseline (mean [SD] age, 56.5 [8.1] years; range, 38-73 years; 252 778 [54.3%] female participants), 3421 were diagnosed with dementia at follow-up (7.5 per 10 000 person-years). If the threshold for a positive test result was calibrated to achieve a 5% false-positive rate, all 4 risk scores detected 9% to 16% of incident dementia and therefore missed 84% to 91% (failure rate). The corresponding failure rate was 84% for a model that included age only. For a positive test result calibrated to detect at least half of future incident dementia, the ratio of true to false positives ranged between 1 to 66 (for CAIDE-APOE-supplemented) and 1 to 116 (for ANU-ADRI). For age alone, the ratio was 1 to 43. The C statistic was 0.66 (95% CI, 0.65-0.67) for the CAIDE clinical version, 0.73 (95% CI, 0.72-0.73) for the CAIDE-APOE-supplemented, 0.68 (95% CI, 0.67-0.69) for BDSI, 0.59 (95% CI, 0.58-0.60) for ANU-ADRI, and 0.79 (95% CI, 0.79-0.80) for age alone. Similar C statistics were seen for 20-year dementia risk in the Whitehall II study cohort, which included 4865 participants (mean [SD] age, 54.9 [5.9] years; 1342 [27.6%] female participants). In a subgroup analysis of same-aged participants aged 65 (±1) years, discriminatory capacity of risk scores was low (C statistics between 0.52 and 0.60).Conclusions and relevanceIn these cohort studies, individualized assessments of dementia risk using existing risk prediction scores had high error rates. These findings suggest that the scores were of limited value in targeting people for dementia prevention. Further research is needed to develop more accurate algorithms for estimation of dementia risk.

Dataset Information

Estimating Identification Disclosure Risk Using Mixed Membership Models.

Publications

Estimating Identification Disclosure Risk Using Mixed Membership Models.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets