Dataset Information

Splitting chemical structure data sets for federated privacy-preserving machine learning.

ABSTRACT: With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more complex in a federated machine learning approach where multiple partners jointly train a model under privacy-preserving conditions where chemical structures must not be shared between the different participating parties. In this work we discuss three methods which provide a splitting of a data set and are applicable in a federated privacy-preserving setting, namely: a. locality-sensitive hashing (LSH), b. sphere exclusion clustering, c. scaffold-based binning (scaffold network). For evaluation of these splitting methods we consider the following quality criteria (compared to random splitting): bias in prediction performance, classification label and data imbalance, similarity distance between the test and training set compounds. The main findings of the paper are a. both sphere exclusion clustering and scaffold-based binning result in high quality splitting of the data sets, b. in terms of compute costs sphere exclusion clustering is very expensive in the case of federated privacy-preserving setting.

SUBMITTER: Simm J

PROVIDER: S-EPMC8650276 | biostudies-literature | 2021 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Splitting chemical structure data sets for federated privacy-preserving machine learning.

Simm Jaak J Humbeck Lina L Zalewski Adam A Sturm Noe N Heyndrickx Wouter W Moreau Yves Y Beck Bernd B Schuffenhauer Ansgar A

Journal of cheminformatics 20211207 1

With the increase in applications of machine learning methods in drug design and related fields, the challenge of designing sound test sets becomes more and more prominent. The goal of this challenge is to have a realistic split of chemical structures (compounds) between training, validation and test set such that the performance on the test set is meaningful to infer the performance in a prospective application. This challenge is by its own very interesting and relevant, but is even more comple ...[more]

PMID: 34876230

Similar Datasets

Project description:BackgroundMachine Learning (ML) has demonstrated its great potential on medical data analysis. Large datasets collected from diverse sources and settings are essential for ML models in healthcare to achieve better accuracy and generalizability. Sharing data across different healthcare institutions or jurisdictions is challenging because of complex and varying privacy and regulatory requirements. Hence, it is hard but crucial to allow multiple parties to collaboratively train an ML model leveraging the private datasets available at each party without the need for direct sharing of those datasets or compromising the privacy of the datasets through collaboration.MethodsIn this paper, we address this challenge by proposing Decentralized, Collaborative, and Privacy-preserving ML for Multi-Hospital Data (DeCaPH). This framework offers the following key benefits: (1) it allows different parties to collaboratively train an ML model without transferring their private datasets (i.e., no data centralization); (2) it safeguards patients' privacy by limiting the potential privacy leakage arising from any contents shared across the parties during the training process; and (3) it facilitates the ML model training without relying on a centralized party/server.FindingsWe demonstrate the generalizability and power of DeCaPH on three distinct tasks using real-world distributed medical datasets: patient mortality prediction using electronic health records, cell-type classification using single-cell human genomes, and pathology identification using chest radiology images. The ML models trained with DeCaPH framework have less than 3.2% drop in model performance comparing to those trained by the non-privacy-preserving collaborative framework. Meanwhile, the average vulnerability to privacy attacks of the models trained with DeCaPH decreased by up to 16%. In addition, models trained with our DeCaPH framework achieve better performance than those models trained solely with the private datasets from individual parties without collaboration and those trained with the previous privacy-preserving collaborative training framework under the same privacy guarantee by up to 70% and 18.2% respectively.InterpretationWe demonstrate that the ML models trained with DeCaPH framework have an improved utility-privacy trade-off, showing DeCaPH enables the models to have good performance while preserving the privacy of the training data points. In addition, the ML models trained with DeCaPH framework in general outperform those trained solely with the private datasets from individual parties, showing that DeCaPH enhances the model generalizability.FundingThis work was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC, RGPIN-2020-06189 and DGECR-2020-00294), Canadian Institute for Advanced Research (CIFAR) AI Catalyst Grants, CIFAR AI Chair programs, Temerty Professor of AI Research and Education in Medicine, University of Toronto, Amazon, Apple, DARPA through the GARD project, Intel, Meta, the Ontario Early Researcher Award, and the Sloan Foundation. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.

Project description:BackgroundThe use of wearables facilitates data collection at a previously unobtainable scale, enabling the construction of complex predictive models with the potential to improve health. However, the highly personal nature of these data requires strong privacy protection against data breaches and the use of data in a way that users do not intend. One method to protect user privacy while taking advantage of sharing data across users is federated learning, a technique that allows a machine learning model to be trained using data from all users while only storing a user's data on that user's device. By keeping data on users' devices, federated learning protects users' private data from data leaks and breaches on the researcher's central server and provides users with more control over how and when their data are used. However, there are few rigorous studies on the effectiveness of federated learning in the mobile health (mHealth) domain.ObjectiveWe review federated learning and assess whether it can be useful in the mHealth field, especially for addressing common mHealth challenges such as privacy concerns and user heterogeneity. The aims of this study are to describe federated learning in an mHealth context, apply a simulation of federated learning to an mHealth data set, and compare the performance of federated learning with the performance of other predictive models.MethodsWe applied a simulation of federated learning to predict the affective state of 15 subjects using physiological and motion data collected from a chest-worn device for approximately 36 minutes. We compared the results from this federated model with those from a centralized or server model and with the results from training individual models for each subject.ResultsIn a 3-class classification problem using physiological and motion data to predict whether the subject was undertaking a neutral, amusing, or stressful task, the federated model achieved 92.8% accuracy on average, the server model achieved 93.2% accuracy on average, and the individual model achieved 90.2% accuracy on average.ConclusionsOur findings support the potential for using federated learning in mHealth. The results showed that the federated model performed better than a model trained separately on each individual and nearly as well as the server model. As federated learning offers more privacy than a server model, it may be a valuable option for designing sensitive data collection methods.

Dataset Information

Splitting chemical structure data sets for federated privacy-preserving machine learning.

Publications

Splitting chemical structure data sets for federated privacy-preserving machine learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets