Dataset Information

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

ABSTRACT:

Objective

We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).

Materials and methods

Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either in training or in validation folds. Monte Carlo simulations are performed to investigate the properties of fold-stratified cross-validation in the case of a model data analysis using both synthetic data and MIMIC-III (Medical Information Mart for Intensive Care-III) medical records.

Results

In situations in which duplicated EHRs could induce overoptimistic estimations of accuracy, applying fold-stratified cross-validation prevented this bias, while not requiring full deduplication. However, a pessimistic bias might appear if the covariate used for the stratification was strongly associated with the outcome.

Discussion

Although fold-stratified cross-validation presents low computational overhead, to be efficient it requires the preliminary identification of a covariate that is both shared by duplicated records and weakly associated with the outcome. When available, the hash of a personal identifier or a patient's date of birth provides such a covariate. On the contrary, pseudonymization interferes with fold-stratified cross-validation, as it may break the equality of the stratifying covariate among duplicates.

Conclusion

Fold-stratified cross-validation is an easy-to-implement methodology that prevents data leakage when a model is trained on distributed EHRs that contain duplicates, while preserving privacy.

SUBMITTER: Bey R

PROVIDER: S-EPMC7647321 | biostudies-literature | 2020 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

Bey Romain R Goussault Romain R Grolleau François F Benchoufi Mehdi M Porcher Raphaël R

Journal of the American Medical Informatics Association : JAMIA 20200801 8

<h4>Objective</h4>We introduce fold-stratified cross-validation, a validation methodology that is compatible with privacy-preserving federated learning and that prevents data leakage caused by duplicates of electronic health records (EHRs).<h4>Materials and methods</h4>Fold-stratified cross-validation complements cross-validation with an initial stratification of EHRs in folds containing patients with similar characteristics, thus ensuring that duplicates of a record are jointly present either i ...[more]

PMID: 32620945

Dataset Information

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

Objective

Materials and methods

Results

Discussion

Conclusion

Publications

Fold-stratified cross-validation for unbiased and privacy-preserving federated learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Privacy-preserving patient clustering for personalized federated learning
| S-EPMC11376435 | biostudies-literature

Collaborative and privacy-preserving cross-vendor united diagnostic imaging via server-rotating federated machine learning
| S-EPMC12335533 | biostudies-literature

Splitting chemical structure data sets for federated privacy-preserving machine learning.
| S-EPMC8650276 | biostudies-literature

Privacy-preserving federated neural network learning for disease-associated cell classification.
| S-EPMC9122966 | biostudies-literature

Distributed cross-learning for equitable federated models - privacy-preserving prediction on data from five California hospitals.
| S-EPMC11799213 | biostudies-literature

PPML-Omics: A privacy-preserving federated machine learning method protects patients' privacy in omic data.
| S-EPMC10830108 | biostudies-literature

Federated deep learning for detecting COVID-19 lung abnormalities in CT: a privacy-preserving multinational validation study.
| S-EPMC8007806 | biostudies-literature

Privacy-Preserving Patient Similarity Learning in a Federated Environment: Development and Analysis.
| S-EPMC5924379 | biostudies-literature

Privacy-Preserving Glycemic Management in Type 1 Diabetes: Development and Validation of a Multiobjective Federated Reinforcement Learning Framework.
| S-EPMC12248133 | biostudies-literature

The development and validation of a privacy-preserving model based on federated learning for diagnosing severe pediatric pneumonia.
| S-EPMC12268547 | biostudies-literature