Dataset Information

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

ABSTRACT:

Objective

Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII.

Materials and methods

Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers.

Results

Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers.

Discussion and conclusions

Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario-more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods.

SUBMITTER: Carrell DS

PROVIDER: S-EPMC7647331 | biostudies-literature | 2020 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Carrell David S DS Malin Bradley A BA Cronkite David J DJ Aberdeen John S JS Clark Cheryl C Li Muqun Rachel MR Bastakoty Dikshya D Nyemba Steve S Hirschman Lynette L

Journal of the American Medical Informatics Association : JAMIA 20200701 9

<h4>Objective</h4>Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII.<h4>Materials and methods</h4>Using 2000 re ...[more]

PMID: 32930712

Similar Datasets

Project description:Cultivation in the laboratory is essential for understanding the phenotypic characteristics and environmental preferences of bacteria. However, basic phenotypic information is not readily accessible. Here, we compiled phenotypic and environmental tolerance information for >5,000 bacterial strains described in the International Journal of Systematic and Evolutionary Microbiology (IJSEM) with all information made publicly available in an updatable database. Although the data span 23 different bacterial phyla, most entries described aerobic, mesophilic, neutrophilic strains from Proteobacteria (mainly Alpha- and Gammaproteobacteria), Actinobacteria, Firmicutes, and Bacteroidetes isolated from soils, marine habitats, and plants. Most of the routinely measured traits tended to show a significant phylogenetic signal, although this signal was weak for environmental preferences. We demonstrated how this database could be used to link genomic attributes to differences in pH and salinity optima. We found that adaptations to high salinity or high-pH conditions are related to cell surface transporter genes, along with previously uncharacterized genes that might play a role in regulating environmental tolerances. Together, this work highlights the utility of this database for associating bacterial taxonomy, phylogeny, or specific genes to measured phenotypic traits and emphasizes the need for more comprehensive and consistent measurements of traits across a broader diversity of bacteria. IMPORTANCE Cultivation in the laboratory is key for understanding the phenotypic characteristics, growth requirements, metabolism, and environmental preferences of bacteria. However, oftentimes, phenotypic information is not easily accessible. Here, we compiled phenotypic and environmental tolerance information for >5,000 bacterial strains described in the International Journal of Systematic and Evolutionary Microbiology (IJSEM). We demonstrate how this database can be used to link bacterial taxonomy, phylogeny, or specific genes to measured phenotypic traits and environmental preferences. The phenotypic database can be freely accessed (https://doi.org/10.6084/m9.figshare.4272392), and we have included instructions for researchers interested in adding new entries or curating existing ones.

Dataset Information

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Objective

Materials and methods

Results

Discussion and conclusions

Publications

Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets