Dataset Information

Evaluating the utility of synthetic COVID-19 case data.

ABSTRACT:

Background

Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner.

Objectives

Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data.

Methods

A gradient boosted classification tree was built to predict death using Ontario's 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics. Model accuracy and relationships were evaluated, as well as privacy risks. The same model was developed on a synthesized dataset and compared to one from the original data.

Results

The AUROC and AUPRC for the real data model were 0.945 [95% confidence interval (CI), 0.941-0.948] and 0.34 (95% CI, 0.313-0.368), respectively. The synthetic data model had AUROC and AUPRC of 0.94 (95% CI, 0.936-0.944) and 0.313 (95% CI, 0.286-0.342) with confidence interval overlap of 45.05% and 52.02% when compared with the real data. The most important predictors of death for the real and synthetic models were in descending order: age, days since January 1, 2020, type of exposure, and gender. The functional relationships were similar between the two data sets. Attribute disclosure risks were 0.0585, and membership disclosure risk was low.

Conclusions

This synthetic dataset could be used as a proxy for the real dataset.

SUBMITTER: El Emam K

PROVIDER: S-EPMC7936723 | biostudies-literature | 2021 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evaluating the utility of synthetic COVID-19 case data.

El Emam Khaled K Mosquera Lucy L Jonker Elizabeth E Sood Harpreet H

JAMIA open 20210101 1

<h4>Background</h4>Concerns about patient privacy have limited access to COVID-19 datasets. Data synthesis is one approach for making such data broadly available to the research community in a privacy protective manner.<h4>Objectives</h4>Evaluate the utility of synthetic data by comparing analysis results between real and synthetic data.<h4>Methods</h4>A gradient boosted classification tree was built to predict death using Ontario's 90 514 COVID-19 case records linked with community comorbidity, ...[more]

PMID: 33709065

Dataset Information

Evaluating the utility of synthetic COVID-19 case data.

Background

Objectives

Methods

Results

Conclusions

Publications

Evaluating the utility of synthetic COVID-19 case data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Evaluating the utility of data integration with synthetic data and statistical matching
| S-EPMC12402339 | biostudies-literature

Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study.
| S-EPMC9030990 | biostudies-literature

Evaluating the accuracy of survey data: a case study of COVID-19 vaccination rates in Germany.
| S-EPMC12541969 | biostudies-literature

Synthea™ Novel coronavirus (COVID-19) model and synthetic data set.
| S-EPMC7531559 | biostudies-literature

Data monitoring committees for clinical trials evaluating treatments of COVID-19.
| S-EPMC7833551 | biostudies-literature

Demonstration COVID-19 Data Hub
| PRJEB47132 | ENA

A Bioconductor workflow for processing, evaluating and interpreting expression proteomics data: Case data
2023-06-29 | PXD041794 | Pride

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: Results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).
| S-EPMC8282114 | biostudies-literature

Demonstrating an approach for evaluating synthetic geospatial and temporal epidemiologic data utility: results from analyzing >1.8 million SARS-CoV-2 tests in the United States National COVID Cohort Collaborative (N3C).
| S-EPMC8992357 | biostudies-literature

Evaluating AUC estimators across complex sampling designs: insights from COVID-19 patient data
| S-EPMC12333297 | biostudies-literature