Dataset Information

Size matters: how population size influences genotype-phenotype association studies in anonymized data.

ABSTRACT:

Objective

Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be "re-identified" through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the utility of the resulting data. We systematically investigate this issue using a large-scale biorepository and EMR system through which we evaluate the ability of researchers to learn from anonymized data for genome-phenome association studies under various conditions.

Methods

We use a k-anonymization strategy to simulate a data protection process (on data sets containing clinical codes) for resources of similar size to those found at nine academic medical institutions within the United States. Following the protection process, we replicate an existing genome-phenome association study and compare the discoveries using the protected data and the original data through the correlation (r(2)) of the p-values of association significance.

Results

Our investigation shows that anonymizing an entire dataset with respect to the population from which it is derived yields significantly more utility than small study-specific datasets anonymized unto themselves. When evaluated using the correlation of genome-phenome association strengths on anonymized data versus original data, all nine simulated sites, results from largest-scale anonymizations (population ∼100,000) retained better utility to those on smaller sizes (population ∼6000-75,000). We observed a general trend of increasing r(2) for larger data set sizes: r(2)=0.9481 for small-sized datasets, r(2)=0.9493 for moderately-sized datasets, r(2)=0.9934 for large-sized datasets.

Conclusions

This research implies that regardless of the overall size of an institution's data, there may be significant benefits to anonymization of the entire EMR, even if the institution is planning on releasing only data about a specific cohort of patients.

SUBMITTER: Heatherly R

PROVIDER: S-EPMC4260994 | biostudies-literature | 2014 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Size matters: how population size influences genotype-phenotype association studies in anonymized data.

Heatherly Raymond R Denny Joshua C JC Haines Jonathan L JL Roden Dan M DM Malin Bradley A BA

Journal of biomedical informatics 20140716

<h4>Objective</h4>Electronic medical records (EMRs) data is increasingly incorporated into genome-phenome association studies. Investigators hope to share data, but there are concerns it may be "re-identified" through the exploitation of various features, such as combinations of standardized clinical codes. Formal anonymization algorithms (e.g., k-anonymization) can prevent such violations, but prior studies suggest that the size of the population available for anonymization may influence the ut ...[more]

PMID: 25038554

Dataset Information

Size matters: how population size influences genotype-phenotype association studies in anonymized data.

Objective

Methods

Results

Conclusions

Publications

Size matters: how population size influences genotype-phenotype association studies in anonymized data.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Autoencoder-transformed transcriptome improves genotype-phenotype association studies
| S-EPMC12356107 | biostudies-literature

A novel method for multiple phenotype association studies based on genotype and phenotype network.
| S-EPMC11111089 | biostudies-literature

GWAS Central: an expanding resource for finding and visualising genotype and phenotype data from genome-wide association studies.
| S-EPMC9825503 | biostudies-literature

PopMLvis: a tool for analysis and visualization of population structure using genotype data from genome-wide association studies.
| S-EPMC11389123 | biostudies-literature

Constructing genotype and phenotype network helps reveal disease heritability and phenome-wide association studies.
| S-EPMC12817677 | biostudies-literature

Multivariate Analysis of Genotype-Phenotype Association.
| S-EPMC4905550 | biostudies-literature

Backward genotype-transcript-phenotype association mapping.
| S-EPMC6743326 | biostudies-literature

Amelogenesis imperfecta: genotype-phenotype studies in 71 families.
| S-EPMC3178091 | biostudies-literature

Genotype and Phenotype Studies in Autosomal Dominant Retinitis Pigmentosa (adRP) of the French Canadian Founder Population.
| S-EPMC4699406 | biostudies-literature

Size matters! Association between journal size and longitudinal variability of the Journal Impact Factor.
| S-EPMC6874322 | biostudies-literature