Project description:With an aging patient population and increasing complexity in patient disease trajectories, physicians are often met with complex patient histories from which clinical decisions must be made. Due to the increasing rate of adverse events and hospitals facing financial penalties for readmission, there has never been a greater need to enforce evidence-led medical decision-making using available health care data. In the present work, we studied a cohort of 7,741 patients, of whom 4,080 were diagnosed with cancer, surgically treated at a University Hospital in the years 2004-2012. We have developed a methodology that allows disease trajectories of the cancer patients to be estimated from free text in electronic health records (EHRs). By using these disease trajectories, we predict 80% of patient events ahead in time. By control of confounders from 8326 quantified events, we identified 557 events that constitute high subsequent risks (risk > 20%), including six events for cancer and seven events for metastasis. We believe that the presented methodology and findings could be used to improve clinical decision support and personalize trajectories, thereby decreasing adverse events and optimizing cancer treatment.
Project description:IntroductionLow vision rehabilitation improves quality-of-life for visually impaired patients, but referral rates fall short of national guidelines. Automatically identifying, from electronic health records (EHR), patients with poor visual prognosis could allow targeted referrals to low vision services. The purpose of this study was to build and evaluate deep learning models that integrate EHR data that is both structured and free-text to predict visual prognosis.MethodsWe identified 5547 patients with low vision (defined as best documented visual acuity (VA) less than 20/40) on ≥ 1 encounter from EHR from 2009 to 2018, with ≥ 1 year of follow-up from the earliest date of low vision, who did not improve to greater than 20/40 over 1 year. Ophthalmology notes on or prior to the index date were extracted. Structured data available from the EHR included demographics, billing and procedure codes, medications, and exam findings including VA, intraocular pressure, corneal thickness, and refraction. To predict whether low vision patients would still have low vision a year later, we developed and compared deep learning models that used structured inputs and free-text progress notes. We compared three different representations of progress notes, including 1) using previously developed ophthalmology domain-specific word embeddings, and representing medical concepts from notes as 2) named entities represented by one-hot vectors and 3) named entities represented as embeddings. Standard performance metrics including area under the receiver operating curve (AUROC) and F1 score were evaluated on a held-out test set.ResultsAmong the 5547 low vision patients in our cohort, 40.7% (N = 2258) never improved to better than 20/40 over one year of follow-up. Our single-modality deep learning model based on structured inputs was able to predict low vision prognosis with AUROC of 80% and F1 score of 70%. Deep learning models utilizing named entity recognition achieved an AUROC of 79% and F1 score of 63%. Deep learning models further augmented with free-text inputs using domain-specific word embeddings, were able to achieve AUROC of 82% and F1 score of 69%, outperforming all single- and multiple-modality models representing text with biomedical concepts extracted through named entity recognition pipelines.DiscussionFree text progress notes within the EHR provide valuable information relevant to predicting patients' visual prognosis. We observed that representing free-text using domain-specific word embeddings led to better performance than representing free-text using extracted named entities. The incorporation of domain-specific embeddings improved the performance over structured models, suggesting that domain-specific text representations may be especially important to the performance of predictive models in highly subspecialized fields such as ophthalmology.
Project description:BackgroundA vast amount of potentially useful information such as description of patient symptoms, family, and social history is recorded as free-text notes in electronic health records (EHRs) but is difficult to reliably extract at scale, limiting their utility in research. This study aims to assess whether an "out of the box" implementation of open-source large language models (LLMs) without any fine-tuning can accurately extract social determinants of health (SDoH) data from free-text clinical notes.MethodsWe conducted a cross-sectional study using EHR data from the Mass General Brigham (MGB) system, analyzing free-text notes for SDoH information. We selected a random sample of 200 patients and manually labeled nine SDoH aspects. Eight advanced open-source LLMs were evaluated against a baseline pattern-matching model. Two human reviewers provided the manual labels, achieving 93% inter-annotator agreement. LLM performance was assessed using accuracy metrics for overall, mentioned, and non-mentioned SDoH, and macro F1 scores.ResultsLLMs outperformed the baseline pattern-matching approach, particularly for explicitly mentioned SDoH, achieving up to 40% higher Accuracymentioned. openchat_3.5 was the best-performing model, surpassing the baseline in overall accuracy across all nine SDoH aspects. The refined pipeline with prompt engineering reduced hallucinations and improved accuracy.ConclusionsOpen-source LLMs are effective and scalable tools for extracting SDoH from unstructured EHRs, surpassing traditional pattern-matching methods. Further refinement and domain-specific training could enhance their utility in clinical research and predictive analytics, improving healthcare outcomes and addressing health disparities.
Project description:Advance care planning (ACP) discussions seek to guide future serious illness care. These discussions may be recorded in the electronic health record by documentation in clinical notes, structured forms and directives, and physician orders. Yet, most studies of ACP prevalence have only examined structured electronic health record elements and ignored data existing in notes. We sought to investigate the relative comprehensiveness and accuracy of ACP documentation from structured and unstructured electronic health record data sources. We evaluated structured and unstructured ACP documentation present in the electronic health records of 435 patients with cancer drawn from three separate healthcare systems. We extracted structured ACP documentation by manually annotating written documents and forms scanned into the electronic health record. We coded unstructured ACP documentation using a rule-based natural language processing software that identified ACP keywords within clinical notes and was subsequently reviewed for accuracy. The unstructured approach identified more instances of ACP documentation (238, 54.7% of patients) than the structured ACP approach (187, 42.9% of patients). Additionally, 16.6% of all patients with structured ACP documentation only had documents that were judged as misclassified, incomplete, blank, unavailable, or a duplicate of a previously entered erroneous document. ACP documents scanned into electronic health records represent a limited view of ACP activity. Research and measures of clinical practice with ACP should incorporate information from unstructured data.
Project description:BackgroundUse of routinely collected patient data for research and service planning is an explicit policy of the UK National Health Service and UK government. Much clinical information is recorded in free-text letters, reports and notes. These text data are generally lost to research, due to the increased privacy risk compared with structured data. We conducted a citizens' jury which asked members of the public whether their medical free-text data should be shared for research for public benefit, to inform an ethical policy.MethodsEighteen citizens took part over 3 days. Jurors heard a range of expert presentations as well as arguments for and against sharing free text, and then questioned presenters and deliberated together. They answered a questionnaire on whether and how free text should be shared for research, gave reasons for and against sharing and suggestions for alleviating their concerns.ResultsJurors were in favour of sharing medical data and agreed this would benefit health research, but were more cautious about sharing free-text than structured data. They preferred processing of free text where a computer extracted information at scale. Their concerns were lack of transparency in uses of data, and privacy risks. They suggested keeping patients informed about uses of their data, and giving clear pathways to opt out of data sharing.ConclusionsInformed citizens suggested a transparent culture of research for the public benefit, and continuous improvement of technology to protect patient privacy, to mitigate their concerns regarding privacy risks of using patient text data.
Project description:Purpose The purpose of this study was to develop a model to predict whether or not glaucoma will progress to the point of requiring surgery within the following year, using data from electronic health records (EHRs), including both structured data and free-text progress notes. Methods A cohort of adult glaucoma patients was identified from the EHR at Stanford University between 2008 and 2020, with data including free-text clinical notes, demographics, diagnosis codes, prior surgeries, and clinical information, including intraocular pressure, visual acuity, and central corneal thickness. Words from patients’ notes were mapped to ophthalmology domain-specific neural word embeddings. Word embeddings and structured clinical data were combined as inputs to deep learning models to predict whether a patient would undergo glaucoma surgery in the following 12 months using the previous 4-12 months of clinical data. We also evaluated models using only structured data inputs (regression-, tree-, and deep-learning-based models) and models using only text inputs. Results Of the 3,469 glaucoma patients included in our cohort, 26% underwent surgery. The baseline penalized logistic regression model achieved an area under the receiver operating curve (AUC) of 0.873 and F1 score of 0.750, compared with the best tree-based model (random forest, AUC 0.876; F1 0.746), the deep learning structured features model (AUC 0.885; F1 0.757), the deep learning clinical free-text features model (AUC 0.767; F1 0.536), and the deep learning model with both the structured clinical features and free-text features (AUC 0.899; F1 0.745). Discussion Fusion models combining text and EHR structured data successfully and accurately predicted glaucoma progression to surgery. Future research incorporating imaging data could further optimize this predictive approach and be translated into clinical decision support tools.
Project description:BackgroundThe Norwegian Trauma Registry (NTR) is designed to monitor and improve the quality and outcome of trauma care delivered by Norwegian trauma hospitals. Patient care is evaluated through specific quality indicators, which are constructed of variables reported to the registry by certified registrars. Having high-quality data recorded in the registry is essential for the validity, trust and use of data. This study aims to perform a data quality check of a subset of core data elements in the registry by assessing agreement between data in the NTR and corresponding data in electronic patient records (EPRs).MethodsWe validated 49 of the 118 variables registered in the NTR by comparing those with the corresponding ones in electronic patient records for 180 patients with a trauma diagnosis admitted in 2019 at eight public hospitals. Agreement was quantified by calculating observed agreement, Cohen's Kappa and Gwet's first agreement coefficient (AC1) with 95% confidence intervals (CIs) for 27 nominal variables, quadratic weighted Cohen's Kappa and Gwet's second agreement coefficient (AC2) for five ordinal variables. For nine continuous, one date and seven time variables, we calculated intraclass correlation coefficient (ICC).ResultsAlmost perfect agreement (AC1 /AC2/ ICC > 0.80) was observed for all examined variables. Nominal and ordinal variables showed Gwet's agreement coefficients ranging from 0.85 (95% CI: 0.79-0.91) to 1.00 (95% CI: 1.00-1.00). For continuous and time variables there were detected high values of intraclass correlation coefficients (ICC) between 0.88 (95% CI: 0.83-0.91) and 1.00 (CI 95%: 1.00-1.00). While missing values in both the NTR and EPRs were in general negligeable, we found a substantial amount of missing registrations for a continuous "Base excess" in the NTR. For some of the time variables missing values both in the NTR and EPRs were high.ConclusionAll tested variables in the Norwegian Trauma Registry displayed excellent agreement with the corresponding variables in electronic patient records. Variables in the registry that showed missing data need further examination.
Project description:Electronic Health Records (EHR) data can provide novel insights into inpatient trajectories. Blood tests and vital signs from de-identified patients' hospital admission episodes (AE) were represented as multivariate time-series (MVTS) to train unsupervised Hidden Markov Models (HMM) and represent each AE day as one of 17 states. All HMM states were clinically interpreted based on their patterns of MVTS variables and relationships with clinical information. Visualization differentiated patients progressing toward stable 'discharge-like' states versus those remaining at risk of inpatient mortality (IM). Chi-square tests confirmed these relationships (two states associated with IM; 12 states with ≥1 diagnosis). Logistic Regression and Random Forest (RF) models trained with MVTS data rather than states had higher prediction performances of IM, but results were comparable (best RF model AUC-ROC: MVTS data = 0.85; HMM states = 0.79). ML models extracted clinically interpretable signals from hospital data. The potential of ML to develop decision-support tools for EHR systems warrants investigation.
Project description:BackgroundKorian is a private group specializing in medical accommodations for elderly and dependent people. A professional data warehouse (DWH) established in 2010 hosts all of the residents' data. Inside this information system (IS), clinical narratives (CNs) were used only by medical staff as a residents' care linking tool. The objective of this study was to show that, through qualitative and quantitative textual analysis of a relatively small physiotherapy and well-defined CN sample, it was possible to build a physiotherapy corpus and, through this process, generate a new body of knowledge by adding relevant information to describe the residents' care and lives.MethodsMeaningful words were extracted through Standard Query Language (SQL) with the LIKE function and wildcards to perform pattern matching, followed by text mining and a word cloud using R® packages. Another step involved principal components and multiple correspondence analyses, plus clustering on the same residents' sample as well as on other health data using a health model measuring the residents' care level needs.ResultsBy combining these techniques, physiotherapy treatments could be characterized by a list of constructed keywords, and the residents' health characteristics were built. Feeding defects or health outlier groups could be detected, physiotherapy residents' data and their health data were matched, and differences in health situations showed qualitative and quantitative differences in physiotherapy narratives.ConclusionsThis textual experiment using a textual process in two stages showed that text mining and data mining techniques provide convenient tools to improve residents' health and quality of care by adding new, simple, useable data to the electronic health record (EHR). When used with a normalized physiotherapy problem list, text mining through information extraction (IE), named entity recognition (NER) and data mining (DM) can provide a real advantage to describe health care, adding new medical material and helping to integrate the EHR system into the health staff work environment.
Project description:Analyses of search engine and social media feeds have been attempted for infectious disease outbreaks, but have been found to be susceptible to artefactual distortions from health scares or keyword spamming in social media or the public internet. We describe an approach using real-time aggregation of keywords and phrases of freetext from real-time clinician-generated documentation in electronic health records to produce a customisable real-time viral pneumonia signal providing up to 4 days warning for secondary care capacity planning. This low-cost approach is open-source, is locally customisable, is not dependent on any specific electronic health record system and can provide an ensemble of signals if deployed at multiple organisational scales.