Project description:The human genetics community needs robust protocols that enable secure sharing of genomic data from participants in genetic research. Beacons are web servers that answer allele-presence queries--such as "Do you have a genome that has a specific nucleotide (e.g., A) at a specific genomic position (e.g., position 11,272 on chromosome 1)?"--with either "yes" or "no." Here, we show that individuals in a beacon are susceptible to re-identification even if the only data shared include presence or absence information about alleles in a beacon. Specifically, we propose a likelihood-ratio test of whether a given individual is present in a given genetic beacon. Our test is not dependent on allele frequencies and is the most powerful test for a specified false-positive rate. Through simulations, we showed that in a beacon with 1,000 individuals, re-identification is possible with just 5,000 queries. Relatives can also be identified in the beacon. Re-identification is possible even in the presence of sequencing errors and variant-calling differences. In a beacon constructed with 65 European individuals from the 1000 Genomes Project, we demonstrated that it is possible to detect membership in the beacon with just 250 SNPs. With just 1,000 SNP queries, we were able to detect the presence of an individual genome from the Personal Genome Project in an existing beacon. Our results show that beacons can disclose membership and implied phenotypic information about participants and do not protect privacy a priori. We discuss risk mitigation through policies and standards such as not allowing anonymous pings of genetic beacons and requiring minimum beacon sizes.
Project description:Several studies underscore the potential of deep learning in identifying complex patterns, leading to diagnostic and prognostic biomarkers. Identifying sufficiently large and diverse datasets, required for training, is a significant challenge in medicine and can rarely be found in individual institutions. Multi-institutional collaborations based on centrally-shared patient data face privacy and ownership challenges. Federated learning is a novel paradigm for data-private multi-institutional collaborations, where model-learning leverages all available data without sharing data between institutions, by distributing the model-training to the data-owners and aggregating their results. We show that federated learning among 10 institutions results in models reaching 99% of the model quality achieved with centralized data, and evaluate generalizability on data from institutions outside the federation. We further investigate the effects of data distribution across collaborating institutions on model quality and learning patterns, indicating that increased access to data through data private multi-institutional collaborations can benefit model quality more than the errors introduced by the collaborative method. Finally, we compare with other collaborative-learning approaches demonstrating the superiority of federated learning, and discuss practical implementation considerations. Clinical adoption of federated learning is expected to lead to models trained on datasets of unprecedented size, hence have a catalytic impact towards precision/personalized medicine.
Project description:Making data broadly accessible is essential to creating a medical information commons (MIC). Transparency about data-sharing practices can cultivate trust among prospective and existing MIC participants. We present an analysis of 34 initiatives sharing DNA-derived data based on public information. We describe data-sharing practices captured, including practices related to consent, privacy and security, data access, oversight, and participant engagement. Our results reveal that data-sharing initiatives have some distance to go in achieving transparency.
Project description:The Cross-Institutional Clinical Translational Research project explored a federated query tool and looked at how this tool can facilitate clinical trial cohort discovery by managing access to aggregate patient data located within unaffiliated academic medical centers.The project adapted software from the Informatics for Integrating Biology and the Bedside (i2b2) program to connect three Clinical Translational Research Award sites: University of Washington, Seattle, University of California, Davis, and University of California, San Francisco. The project developed an iterative spiral software development model to support the implementation and coordination of this multisite data resource.By standardizing technical infrastructures, policies, and semantics, the project enabled federated querying of deidentified clinical datasets stored in separate institutional environments and identified barriers to engaging users for measuring utility.The authors discuss the iterative development and evaluation phases of the project and highlight the challenges identified and the lessons learned.The common system architecture and translational processes provide high-level (aggregate) deidentified access to a large patient population (>5 million patients), and represent a novel and extensible resource. Enhancing the network for more focused disease areas will require research-driven partnerships represented across all partner sites.
Project description:As sequencing prices drop, genomic data accumulates-seemingly at a steadily increasing pace. Most genomic data potentially have value beyond the initial purpose-but only if shared with the scientific community. This, of course, is often easier said than done. Some of the challenges in sharing genomic data include data volume (raw file sizes and number of files), complexities, formats, nomenclatures, metadata descriptions, and the choice of a repository. In this paper, we describe 10 quick tips for sharing open genomic data.
Project description:Data sharing anchors reproducible science, but expectations and best practices are often nebulous. Communities of funders, researchers and publishers continue to grapple with what should be required or encouraged. To illuminate the rationales for sharing data, the technical challenges and the social and cultural challenges, we consider the stakeholders in the scientific enterprise. In biomedical research, participants are key among those stakeholders. Ethical sharing requires considering both the value of research efforts and the privacy costs for participants. We discuss current best practices for various types of genomic data, as well as opportunities to promote ethical data sharing that accelerates science by aligning incentives.
Project description:The sharing of genomic data holds great promise in advancing precision medicine and providing personalized treatments and other types of interventions. However, these opportunities come with privacy concerns, and data misuse could potentially lead to privacy infringement for individuals and their blood relatives. With the rapid growth and increased availability of genomic datasets, understanding the current genome privacy landscape and identifying the challenges in developing effective privacy-protecting solutions are imperative. In this work, we provide an overview of major privacy threats identified by the research community and examine the privacy challenges in the context of emerging direct-to-consumer genetic-testing applications. We additionally present general privacy-protection techniques for genomic data sharing and their potential applications in direct-to-consumer genomic testing and forensic analyses. Finally, we discuss limitations in current privacy-protection methods, highlight possible mitigation strategies and suggest future research opportunities for advancing genomic data sharing.
Project description:Visualization is frequently used to aid our interpretation of complex datasets. Within microbial genomics, visualizing the relationships between multiple genomes as a tree provides a framework onto which associated data (geographical, temporal, phenotypic and epidemiological) are added to generate hypotheses and to explore the dynamics of the system under investigation. Selected static images are then used within publications to highlight the key findings to a wider audience. However, these images are a very inadequate way of exploring and interpreting the richness of the data. There is, therefore, a need for flexible, interactive software that presents the population genomic outputs and associated data in a user-friendly manner for a wide range of end users, from trained bioinformaticians to front-line epidemiologists and health workers. Here, we present Microreact, a web application for the easy visualization of datasets consisting of any combination of trees, geographical, temporal and associated metadata. Data files can be uploaded to Microreact directly via the web browser or by linking to their location (e.g. from Google Drive/Dropbox or via API), and an integrated visualization via trees, maps, timelines and tables provides interactive querying of the data. The visualization can be shared as a permanent web link among collaborators, or embedded within publications to enable readers to explore and download the data. Microreact can act as an end point for any tool or bioinformatic pipeline that ultimately generates a tree, and provides a simple, yet powerful, visualization method that will aid research and discovery and the open sharing of datasets.
Project description:While federated learning (FL) is promising for efficient collaborative learning without revealing local data, it remains vulnerable to white-box privacy attacks, suffers from high communication overhead, and struggles to adapt to heterogeneous models. Federated distillation (FD) emerges as an alternative paradigm to tackle these challenges, which transfers knowledge among clients instead of model parameters. Nevertheless, challenges arise due to variations in local data distributions and the absence of a well-trained teacher model, which leads to misleading and ambiguous knowledge sharing that significantly degrades model performance. To address these issues, this paper proposes a selective knowledge sharing mechanism for FD, termed Selective-FD, to identify accurate and precise knowledge from local and ensemble predictions, respectively. Empirical studies, backed by theoretical insights, demonstrate that our approach enhances the generalization capabilities of the FD framework and consistently outperforms baseline methods. We anticipate our study to enable a privacy-preserving, communication-efficient, and heterogeneity-adaptive federated training framework.
Project description:Effective data sharing is key to accelerating research to improve diagnostic precision, treatment efficacy, and long-term survival in pediatric cancer and other childhood catastrophic diseases. We present St. Jude Cloud (https://www.stjude.cloud), a cloud-based data-sharing ecosystem for accessing, analyzing, and visualizing genomic data from >10,000 pediatric patients with cancer and long-term survivors, and >800 pediatric sickle cell patients. Harmonized genomic data totaling 1.25 petabytes are freely available, including 12,104 whole genomes, 7,697 whole exomes, and 2,202 transcriptomes. The resource is expanding rapidly, with regular data uploads from St. Jude's prospective clinical genomics programs. Three interconnected apps within the ecosystem-Genomics Platform, Pediatric Cancer Knowledgebase, and Visualization Community-enable simultaneously performing advanced data analysis in the cloud and enhancing the Pediatric Cancer knowledgebase. We demonstrate the value of the ecosystem through use cases that classify 135 pediatric cancer subtypes by gene expression profiling and map mutational signatures across 35 pediatric cancer subtypes. SIGNIFICANCE: To advance research and treatment of pediatric cancer, we developed St. Jude Cloud, a data-sharing ecosystem for accessing >1.2 petabytes of raw genomic data from >10,000 pediatric patients and survivors, innovative analysis workflows, integrative multiomics visualizations, and a knowledgebase of published data contributed by the global pediatric cancer community.This article is highlighted in the In This Issue feature, p. 995.