Dataset Information

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning.

ABSTRACT: The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries.

SUBMITTER: Murugadoss K

PROVIDER: S-EPMC8212138 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Diabetes is a metabolic disorder that affects more than 420 million of people worldwide, and it is caused by the presence of a high level of sugar in blood for a long period. Diabetes can have serious long-term health consequences, such as cardiovascular diseases, strokes, chronic kidney diseases, foot ulcers, retinopathy, and others. Even if common, this disease is uneasy to spot, because it often comes with no symptoms. Especially for diabetes type 2, that happens mainly in the adults, knowing how long the diabetes has been present for a patient can have a strong impact on the treatment they can receive. This information, although pivotal, might be absent: for some patients, in fact, the year when they received the diabetes diagnosis might be well-known, but the year of the disease unset might be unknown. In this context, machine learning applied to electronic health records can be an effective tool to predict the past duration of diabetes for a patient. In this study, we applied a regression analysis based on several computational intelligence methods to a dataset of electronic health records of 73 patients with diabetes type 1 with 20 variables and another dataset of records of 400 patients of diabetes type 2 with 49 variables. Among the algorithms applied, Random Forests was able to outperform the other ones and to efficiently predict diabetes duration for both the cohorts, with the regression performances measured through the coefficient of determination R2. Afterwards, we applied the same method for feature ranking, and we detected the most relevant factors of the clinical records correlated with past diabetes duration: age, insulin intake, and body-mass index. Our study discoveries can have profound impact on clinical practice: when the information about the duration of diabetes of patient is missing, medical doctors can use our tool and focus on age, insulin intake, and body-mass index to infer this important aspect. Regarding limitations, unfortunately we were unable to find additional dataset of EHRs of patients with diabetes having the same variables of the two analyzed here, so we could not verify our findings on a validation cohort.

Project description:BACKGROUND:Adverse events in health care entail substantial burdens to health care systems, institutions, and patients. Retrospective trigger tools are often manually applied to detect AEs, although automated approaches using electronic health records may offer real-time adverse event detection, allowing timely corrective interventions. OBJECTIVE:The aim of this systematic review was to describe current study methods and challenges regarding the use of automatic trigger tool-based adverse event detection methods in electronic health records. In addition, we aimed to appraise the applied studies' designs and to synthesize estimates of adverse event prevalence and diagnostic test accuracy of automatic detection methods using manual trigger tool as a reference standard. METHODS:PubMed, EMBASE, CINAHL, and the Cochrane Library were queried. We included observational studies, applying trigger tools in acute care settings, and excluded studies using nonhospital and outpatient settings. Eligible articles were divided into diagnostic test accuracy studies and prevalence studies. We derived the study prevalence and estimates for the positive predictive value. We assessed bias risks and applicability concerns using Quality Assessment tool for Diagnostic Accuracy Studies-2 (QUADAS-2) for diagnostic test accuracy studies and an in-house developed tool for prevalence studies. RESULTS:A total of 11 studies met all criteria: 2 concerned diagnostic test accuracy and 9 prevalence. We judged several studies to be at high bias risks for their automated detection method, definition of outcomes, and type of statistical analyses. Across all the 11 studies, adverse event prevalence ranged from 0% to 17.9%, with a median of 0.8%. The positive predictive value of all triggers to detect adverse events ranged from 0% to 100% across studies, with a median of 40%. Some triggers had wide ranging positive predictive value values: (1) in 6 studies, hypoglycemia had a positive predictive value ranging from 15.8% to 60%; (2) in 5 studies, naloxone had a positive predictive value ranging from 20% to 91%; (3) in 4 studies, flumazenil had a positive predictive value ranging from 38.9% to 83.3%; and (4) in 4 studies, protamine had a positive predictive value ranging from 0% to 60%. We were unable to determine the adverse event prevalence, positive predictive value, preventability, and severity in 40.4%, 10.5%, 71.1%, and 68.4% of the studies, respectively. These studies did not report the overall number of records analyzed, triggers, or adverse events; or the studies did not conduct the analysis. CONCLUSIONS:We observed broad interstudy variation in reported adverse event prevalence and positive predictive value. The lack of sufficiently described methods led to difficulties regarding interpretation. To improve quality, we see the need for a set of recommendations to endorse optimal use of research designs and adequate reporting of future adverse event detection studies.

Project description:BackgroundText-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification.MethodsWe describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus.ResultsPerformance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus.ConclusionWe have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.

Project description:BACKGROUND:Circulating biomarkers can facilitate diagnosis and risk stratification for complex conditions such as heart failure (HF). Newer molecular platforms can accelerate biomarker discovery, but they require significant resources for data and sample acquisition. OBJECTIVES:The purpose of this study was to test a pragmatic biomarker discovery strategy integrating automated clinical biobanking with proteomics. METHODS:Using the electronic health record, the authors identified patients with and without HF, retrieved their discarded plasma samples, and screened these specimens using a DNA aptamer-based proteomic platform (1,129 proteins). Candidate biomarkers were validated in 3 different prospective cohorts. RESULTS:In an automated manner, plasma samples from 1,315 patients (31% with HF) were collected. Proteomic analysis of a 96-patient subset identified 9 candidate biomarkers (p < 4.42 × 10-5). Two proteins, angiopoietin-2 and thrombospondin-2, were associated with HF in 3 separate validation cohorts. In an emergency department-based registry of 852 dyspneic patients, the 2 biomarkers improved discrimination of acute HF compared with a clinical score (p < 0.0001) or clinical score plus B-type natriuretic peptide (p = 0.02). In a community-based cohort (n = 768), both biomarkers predicted incident HF independent of traditional risk factors and N-terminal pro-B-type natriuretic peptide (hazard ratio per SD increment: 1.35 [95% confidence interval: 1.14 to 1.61; p = 0.0007] for angiopoietin-2, and 1.37 [95% confidence interval: 1.06 to 1.79; p = 0.02] for thrombospondin-2). Among 30 advanced HF patients, concentrations of both biomarkers declined (80% to 84%) following cardiac transplant (p < 0.001 for both). CONCLUSIONS:A novel strategy integrating electronic health records, discarded clinical specimens, and proteomics identified 2 biomarkers that robustly predict HF across diverse clinical settings. This approach could accelerate biomarker discovery for many diseases.

Dataset Information

Building a best-in-class automated de-identification tool for electronic health records through ensemble learning.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets