Dataset Information

De-identification of clinical notes via recurrent neural network and conditional random field.

ABSTRACT: De-identification, identifying information from data, such as protected health information (PHI) present in clinical data, is a critical step to enable data to be shared or published. The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) clinical natural language processing (NLP) challenge contains a de-identification track in de-identifying electronic medical records (EMRs) (i.e., track 1). The challenge organizers provide 1000 annotated mental health records for this track, 600 out of which are used as a training set and 400 as a test set. We develop a hybrid system for the de-identification task on the training set. Firstly, four individual subsystems, that is, a subsystem based on bidirectional LSTM (long-short term memory, a variant of recurrent neural network), a subsystem-based on bidirectional LSTM with features, a subsystem based on conditional random field (CRF) and a rule-based subsystem, are used to identify PHI instances. Then, an ensemble learning-based classifiers is deployed to combine all PHI instances predicted by above three machine learning-based subsystems. Finally, the results of the ensemble learning-based classifier and the rule-based subsystem are merged together. Experiments conducted on the official test set show that our system achieves the highest micro F1-scores of 93.07%, 91.43% and 95.23% under the "token", "strict" and "binary token" criteria respectively, ranking first in the 2016 CEGS N-GRID NLP challenge. In addition, on the dataset of 2014 i2b2 NLP challenge, our system achieves the highest micro F1-scores of 96.98%, 95.11% and 98.28% under the "token", "strict" and "binary token" criteria respectively, outperforming other state-of-the-art systems. All these experiments prove the effectiveness of our proposed method.

SUBMITTER: Liu Z

PROVIDER: S-EPMC5705329 | biostudies-literature | 2017 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

De-identification of clinical notes via recurrent neural network and conditional random field.

Liu Zengjian Z Tang Buzhou B Wang Xiaolong X Chen Qingcai Q

Journal of biomedical informatics 20170601

De-identification, identifying information from data, such as protected health information (PHI) present in clinical data, is a critical step to enable data to be shared or published. The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) clinical natural language processing (NLP) challenge contains a de-identification track in de-identifying electronic medical records (EMRs) (i.e., track 1). The challenge organizers provid ...[more]

PMID: 28579533

Similar Datasets

Project description:MotivationHuman microbes play critical roles in drug development and precision medicine. How to systematically understand the complex interaction mechanism between human microbes and drugs remains a challenge nowadays. Identifying microbe-drug associations can not only provide great insights into understanding the mechanism, but also boost the development of drug discovery and repurposing. Considering the high cost and risk of biological experiments, the computational approach is an alternative choice. However, at present, few computational approaches have been developed to tackle this task.ResultsIn this work, we leveraged rich biological information to construct a heterogeneous network for drugs and microbes, including a microbe similarity network, a drug similarity network and a microbe-drug interaction network. We then proposed a novel graph convolutional network (GCN)-based framework for predicting human Microbe-Drug Associations, named GCNMDA. In the hidden layer of GCN, we further exploited the Conditional Random Field (CRF), which can ensure that similar nodes (i.e. microbes or drugs) have similar representations. To more accurately aggregate representations of neighborhoods, an attention mechanism was designed in the CRF layer. Moreover, we performed a random walk with restart-based scheme on both drug and microbe similarity networks to learn valuable features for drugs and microbes, respectively. Experimental results on three different datasets showed that our GCNMDA model consistently achieved better performance than seven state-of-the-art methods. Case studies for three microbes including SARS-CoV-2 and two antimicrobial drugs (i.e. Ciprofloxacin and Moxifloxacin) further confirmed the effectiveness of GCNMDA in identifying potential microbe-drug associations.Availability and implementationPython codes and dataset are available at: https://github.com/longyahui/GCNMDA.Supplementary informationSupplementary data are available at Bioinformatics online.

Project description:BackgroundIn recent years, mobile-based interventions have received more attention as an alternative to on-site obesity management. Despite increased mobile interventions for obesity, there are lost opportunities to achieve better outcomes due to the lack of a predictive model using current existing longitudinal and cross-sectional health data. Noom (Noom Inc) is a mobile app that provides various lifestyle-related logs including food logging, exercise logging, and weight logging.ObjectiveThe aim of this study was to develop a weight change predictive model using an interpretable artificial intelligence algorithm for mobile-based interventions and to explore contributing factors to weight loss.MethodsLifelog mobile app (Noom) user data of individuals who used the weight loss program for 16 weeks in the United States were used to develop an interpretable recurrent neural network algorithm for weight prediction that considers both time-variant and time-fixed variables. From a total of 93,696 users in the coaching program, we excluded users who did not take part in the 16-week weight loss program or who were not overweight or obese or had not entered weight or meal records for the entire 16-week program. This interpretable model was trained and validated with 5-fold cross-validation (training set: 70%; testing: 30%) using the lifelog data. Mean absolute percentage error between actual weight loss and predicted weight was used to measure model performance. To better understand the behavior factors contributing to weight loss or gain, we calculated contribution coefficients in test sets.ResultsA total of 17,867 users' data were included in the analysis. The overall mean absolute percentage error of the model was 3.50%, and the error of the model declined from 3.78% to 3.45% by the end of the program. The time-level attention weighting was shown to be equally distributed at 0.0625 each week, but this gradually decreased (from 0.0626 to 0.0624) as it approached 16 weeks. Factors such as usage pattern, weight input frequency, meal input adherence, exercise, and sharp decreases in weight trajectories had negative contribution coefficients of -0.021, -0.032, -0.015, and -0.066, respectively. For time-fixed variables, being male had a contribution coefficient of -0.091.ConclusionsAn interpretable algorithm, with both time-variant and time-fixed data, was used to precisely predict weight loss while preserving model transparency. This week-to-week prediction model is expected to improve weight loss and provide a global explanation of contributing factors, leading to better outcomes.

Project description:The modeling of genetic interactions within a cell is crucial for a basic understanding of physiology and for applied areas such as drug design. Interactions in gene regulatory networks (GRNs) include effects of transcription factors, repressors, small metabolites, and microRNA species. In addition, the effects of regulatory interactions are not always simultaneous, but can occur after a finite time delay, or as a combined outcome of simultaneous and time delayed interactions. Powerful biotechnologies have been rapidly and successfully measuring levels of genetic expression to illuminate different states of biological systems. This has led to an ensuing challenge to improve the identification of specific regulatory mechanisms through regulatory network reconstructions. Solutions to this challenge will ultimately help to spur forward efforts based on the usage of regulatory network reconstructions in systems biology applications.We have developed a hierarchical recurrent neural network (HRNN) that identifies time-delayed gene interactions using time-course data. A customized genetic algorithm (GA) was used to optimize hierarchical connectivity of regulatory genes and a target gene. The proposed design provides a non-fully connected network with the flexibility of using recurrent connections inside the network. These features and the non-linearity of the HRNN facilitate the process of identifying temporal patterns of a GRN.Our HRNN method was implemented with the Python language. It was first evaluated on simulated data representing linear and nonlinear time-delayed gene-gene interaction models across a range of network sizes and variances of noise. We then further demonstrated the capability of our method in reconstructing GRNs of the Saccharomyces cerevisiae synthetic network for in vivo benchmarking of reverse-engineering and modeling approaches (IRMA). We compared the performance of our method to TD-ARACNE, HCC-CLINDE, TSNI and ebdbNet across different network sizes and levels of stochastic noise. We found our HRNN method to be superior in terms of accuracy for nonlinear data sets with higher amounts of noise.The proposed method identifies time-delayed gene-gene interactions of GRNs. The topology-based advancement of our HRNN worked as expected by more effectively modeling nonlinear data sets. As a non-fully connected network, an added benefit to HRNN was how it helped to find the few genes which regulated the target gene over different time delays.

Dataset Information

De-identification of clinical notes via recurrent neural network and conditional random field.

Publications

De-identification of clinical notes via recurrent neural network and conditional random field.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets