Unknown

Dataset Information

0

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.


ABSTRACT: BACKGROUND:There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE:The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS:A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS:The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS:We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.

SUBMITTER: El Emam K 

PROVIDER: S-EPMC7704280 | biostudies-literature | 2020 Nov

REPOSITORIES: biostudies-literature

altmetric image

Publications

Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation.

El Emam Khaled K   Mosquera Lucy L   Bass Jason J  

Journal of medical Internet research 20201116 11


<h4>Background</h4>There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them.<h4>Objective</h4>The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synth  ...[more]

Similar Datasets

| S-EPMC9553223 | biostudies-literature
| S-EPMC7688496 | biostudies-literature
| S-EPMC10978408 | biostudies-literature
| S-EPMC5101035 | biostudies-literature
| S-EPMC7936723 | biostudies-literature
| S-EPMC8150694 | biostudies-literature
| S-EPMC7183129 | biostudies-literature
| S-EPMC7400044 | biostudies-literature
| S-EPMC6483879 | biostudies-literature
| S-EPMC8633456 | biostudies-literature