Dataset Information

Improved de-identification of physician notes through integrative modeling of both public and private medical text.

ABSTRACT:

Background

Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process that is resource-intensive when performed manually. This study seeks to create an automated method for de-identifying physician notes that does not require large amounts of private information: in addition to training a model to recognize Protected Health Information (PHI) within private physician notes, we reverse the problem and train a model to recognize non-PHI words and phrases that appear in public medical texts.

Methods

Public and private medical text sources were analyzed to distinguish common medical words and phrases from Protected Health Information. Patient identifiers are generally nouns and numbers that appear infrequently in medical literature. To quantify this relationship, term frequencies and part of speech tags were compared between journal publications and physician notes. Standard medical concepts and phrases were then examined across ten medical dictionaries. Lists and rules were included from the US census database and previously published studies. In total, 28 features were used to train decision tree classifiers.

Results

The model successfully recalled 98% of PHI tokens from 220 discharge summaries. Cost sensitive classification was used to weight recall over precision (98% F10 score, 76% F1 score). More than half of the false negatives were the word "of" appearing in a hospital name. All patient names, phone numbers, and home addresses were at least partially redacted. Medical concepts such as "elevated white blood cell count" were informative for de-identification. The results exceed the previously approved criteria established by four Institutional Review Boards.

Conclusions

The results indicate that distributional differences between private and public medical text can be used to accurately classify PHI. The data and algorithms reported here are made freely available for evaluation and improvement.

SUBMITTER: McMurry AJ

PROVIDER: S-EPMC3907029 | biostudies-literature | 2013 Oct

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improved de-identification of physician notes through integrative modeling of both public and private medical text.

McMurry Andrew J AJ Fitch Britt B Savova Guergana G Kohane Isaac S IS Reis Ben Y BY

BMC medical informatics and decision making 20131002

<h4>Background</h4>Physician notes routinely recorded during patient care represent a vast and underutilized resource for human disease studies on a population scale. Their use in research is primarily limited by the need to separate confidential patient information from clinical annotations, a process that is resource-intensive when performed manually. This study seeks to create an automated method for de-identifying physician notes that does not require large amounts of private information: in ...[more]

PMID: 24083569

Similar Datasets

Project description:BackgroundText-based patient medical records are a vital resource in medical research. In order to preserve patient confidentiality, however, the U.S. Health Insurance Portability and Accountability Act (HIPAA) requires that protected health information (PHI) be removed from medical records before they can be disseminated. Manual de-identification of large medical record databases is prohibitively expensive, time-consuming and prone to error, necessitating automatic methods for large-scale, automated de-identification.MethodsWe describe an automated Perl-based de-identification software package that is generally usable on most free-text medical records, e.g., nursing notes, discharge summaries, X-ray reports, etc. The software uses lexical look-up tables, regular expressions, and simple heuristics to locate both HIPAA PHI, and an extended PHI set that includes doctors' names and years of dates. To develop the de-identification approach, we assembled a gold standard corpus of re-identified nursing notes with real PHI replaced by realistic surrogate information. This corpus consists of 2,434 nursing notes containing 334,000 words and a total of 1,779 instances of PHI taken from 163 randomly selected patient records. This gold standard corpus was used to refine the algorithm and measure its sensitivity. To test the algorithm on data not used in its development, we constructed a second test corpus of 1,836 nursing notes containing 296,400 words. The algorithm's false negative rate was evaluated using this test corpus.ResultsPerformance evaluation of the de-identification software on the development corpus yielded an overall recall of 0.967, precision value of 0.749, and fallout value of approximately 0.002. On the test corpus, a total of 90 instances of false negatives were found, or 27 per 100,000 word count, with an estimated recall of 0.943. Only one full date and one age over 89 were missed. No patient names were missed in either corpus.ConclusionWe have developed a pattern-matching de-identification system based on dictionary look-ups, regular expressions, and heuristics. Evaluation based on two different sets of nursing notes collected from a U.S. hospital suggests that, in terms of recall, the software out-performs a single human de-identifier (0.81) and performs at least as well as a consensus of two human de-identifiers (0.94). The system is currently tuned to de-identify PHI in nursing notes and discharge summaries but is sufficiently generalized and can be customized to handle text files of any format. Although the accuracy of the algorithm is high, it is probably insufficient to be used to publicly disseminate medical data. The open-source de-identification software and the gold standard re-identified corpus of medical records have therefore been made available to researchers via the PhysioNet website to encourage improvements in the algorithm.

Project description:ImportanceLimited evidence exists on salary differences between male and female academic physicians, largely owing to difficulty obtaining data on salary and factors influencing salary. Existing studies have been limited by reliance on survey-based approaches to measuring sex differences in earnings, lack of contemporary data, small sample sizes, or limited geographic representation.ObjectiveTo analyze sex differences in earnings among US academic physicians.Design, setting, and participantsFreedom of Information laws mandate release of salary information of public university employees in several states. In 12 states with salary information published online, salary data were extracted on 10 241 academic physicians at 24 public medical schools. These data were linked to a unique physician database with detailed information on sex, age, years of experience, faculty rank, specialty, scientific authorship, National Institutes of Health funding, clinical trial participation, and Medicare reimbursements (proxy for clinical revenue). Sex differences in salary were estimated after adjusting for these factors.ExposuresPhysician sex.Main outcomes and measuresAnnual salary.ResultsAmong 10 241 physicians, female physicians (n = 3549) had lower mean (SD) unadjusted salaries than male physicians ($206 641 [$88 238] vs $257 957 [$137 202]; absolute difference, $51 315 [95% CI, $46 330-$56 301]). Sex differences persisted after multivariable adjustment ($227 783 [95% CI, $224 117-$231 448] vs $247 661 [95% CI, $245 065-$250 258] with an absolute difference of $19 878 [95% CI, $15 261-$24 495]). Sex differences in salary varied across specialties, institutions, and faculty ranks. For example, adjusted salaries of female full professors ($250 971 [95% CI, $242 307-$259 635]) were comparable to those of male associate professors ($247 212 [95% CI, $241 850-$252 575]). Among specialties, adjusted salaries were highest in orthopedic surgery ($358 093 [95% CI, $344 354-$371 831]), surgical subspecialties ($318 760 [95% CI, $311 030-$326 491]), and general surgery ($302 666 [95% CI, $294 060-$311 272]) and lowest in infectious disease, family medicine, and neurology (mean income, <$200 000). Years of experience, total publications, clinical trial participation, and Medicare payments were positively associated with salary.Conclusions and relevanceAmong physicians with faculty appointments at 24 US public medical schools, significant sex differences in salary exist even after accounting for age, experience, specialty, faculty rank, and measures of research productivity and clinical revenue.

Project description:BackgroundSince 1992 ART clinics have been required to report outcome data. Our objective was to assess practitioners' opinions of the impact of public reporting of assisted reproductive technology (ART) outcomes on treatment strategies, medical decision-making, and fellow training.MethodsSurvey study performed in an academic medical center. Members of the Society of Reproductive Endocrinology and Infertility and the Society of Reproductive Surgery were recruited to participate in an online survey in April 2012.: Categorical survey responses were expressed as percentages. Written responses were categorized according to common themes regarding effects of reporting on participants' medical management of patients. The study was primarily qualitative and was not powered to make statistical conclusions.ResultsOf 1019 surveys sent, 323 participants (31.7%) responded from around the United States, and 275 provided complete data. Nearly all (273 of 282; 96.8%) participants responded that public reporting sometimes or always affected other providers' practices, and 264 of 281 (93.9%) responded that other practitioners were motivated to deny care to poor-prognosis patients to improve reported success rates. However, only 121 of 282 (42.9%) indicated that public reporting influenced their own medical management. The majority of respondents agreed that public reporting may hinder adoption of single embryo transfer practices (194 of 299; 64.9%) and contribute to the persistent rate of twinning in in vitro fertilization (187 of 279; 67%). A small majority (153 of 279; 54.8%) felt that public reporting did not benefit fellow training, and 58 (61.7%) of the 94 participants who trained fellows believed that having fellows perform embryo transfers reduced pregnancy rates. A small majority (163 of 277; 58.8%) of respondents reported their ART success rates on clinical websites. However, the majority (200 of 275; 72.7%) of respondents compared their success rates with those of other clinics. Finally, most respondents (211 of 277; 76%) believed that most centers that advertised their success rates did so in ways that were misleading to patients.ConclusionsPublic reporting of ART clinical outcomes is intended to drive improvement, promote trust between patients and providers, and inform consumers and payers. However, providers reported that they modified their practices, felt others denied care to poor-prognosis patients, and limited participation of trainees in procedures in response to public reporting of ART outcomes.

Dataset Information

Improved de-identification of physician notes through integrative modeling of both public and private medical text.

Background

Methods

Results

Conclusions

Publications

Improved de-identification of physician notes through integrative modeling of both public and private medical text.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets