Dataset Information

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.

ABSTRACT: BACKGROUND:There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE:The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS:A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS:The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS:We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.

SUBMITTER: El Emam K

PROVIDER: S-EPMC7704280 | biostudies-literature | 2020 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.

El Emam Khaled K Mosquera Lucy L Bass Jason J

Journal of medical Internet research 20201116 11

<h4>Background</h4>There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them.<h4>Objective</h4>The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synth ...[more]

PMID: 33196453

Similar Datasets

Project description:ImportanceSystems-level barriers to diabetes care could be improved with population health planning tools that accurately discriminate between high- and low-risk groups to guide investments and targeted interventions.ObjectiveTo develop and validate a population-level machine learning model for predicting type 2 diabetes 5 years before diabetes onset using administrative health data.Design, setting, and participantsThis decision analytical model study used linked administrative health data from the diverse, single-payer health system in Ontario, Canada, between January 1, 2006, and December 31, 2016. A gradient boosting decision tree model was trained on data from 1 657 395 patients, validated on 243 442 patients, and tested on 236 506 patients. Costs associated with each patient were estimated using a validated costing algorithm. Data were analyzed from January 1, 2006, to December 31, 2016.ExposuresA random sample of 2 137 343 residents of Ontario without type 2 diabetes was obtained at study start time. More than 300 features from data sets capturing demographic information, laboratory measurements, drug benefits, health care system interactions, social determinants of health, and ambulatory care and hospitalization records were compiled over 2-year patient medical histories to generate quarterly predictions.Main outcomes and measuresDiscrimination was assessed using the area under the receiver operating characteristic curve statistic, and calibration was assessed visually using calibration plots. Feature contribution was assessed with Shapley values. Costs were estimated in 2020 US dollars.ResultsThis study trained a gradient boosting decision tree model on data from 1 657 395 patients (12 900 257 instances; 6 666 662 women [51.7%]). The developed model achieved a test area under the curve of 80.26 (range, 80.21-80.29), demonstrated good calibration, and was robust to sex, immigration status, area-level marginalization with regard to material deprivation and race/ethnicity, and low contact with the health care system. The top 5% of patients predicted as high risk by the model represented 26% of the total annual diabetes cost in Ontario.Conclusions and relevanceIn this decision analytical model study, a machine learning model approach accurately predicted the incidence of diabetes in the population using routinely collected health administrative data. These results suggest that the model could be used to inform decision-making for population health planning and diabetes prevention.

Project description:ObjectivesWe tested the hypothesis that routine monitoring data could describe a detailed and distinct pathophysiologic phenotype of impending hypoglycemia in adult ICU patients.DesignRetrospective analysis leading to model development and validation.SettingAll ICU admissions wherein patients received insulin therapy during a 4-year period at the University of Virginia Medical Center. Each ICU was equipped with continuous physiologic monitoring systems whose signals were archived in an electronic data warehouse along with the entire medical record.PatientsEleven thousand eight hundred forty-seven ICU patient admissions.InterventionsThe primary outcome was hypoglycemia, defined as any episode of blood glucose less than 70 mg/dL where 50% dextrose injection was administered within 1 hour. We used 61 physiologic markers (including vital signs, laboratory values, demographics, and continuous cardiorespiratory monitoring variables) to inform the model.Measurements and main resultsOur dataset consisted of 11,847 ICU patient admissions, 721 (6.1%) of which had one or more hypoglycemic episodes. Multivariable logistic regression analysis revealed a pathophysiologic signature of 41 independent variables that best characterized ICU hypoglycemia. The final model had a cross-validated area under the receiver operating characteristic curve of 0.83 (95% CI, 0.78-0.87) for prediction of impending ICU hypoglycemia. We externally validated the model in the Medical Information Mart for Intensive Care III critical care dataset, where it also demonstrated good performance with an area under the receiver operating characteristic curve of 0.79 (95% CI, 0.77-0.81).ConclusionsWe used data from a large number of critically ill inpatients to develop and externally validate a predictive model of impending ICU hypoglycemia. Future steps include incorporating this model into a clinical decision support system and testing its effects in a multicenter randomized controlled clinical trial.

Dataset Information

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.

Publications

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets