Clinical Phenotypic Spectrum of 4095 Individuals with Down Syndrome from Text Mining of Electronic Health Records.
Ontology highlight
ABSTRACT: Human genetic disorders, such as Down syndrome, have a wide variety of clinical phenotypic presentations, and characterizing each nuanced phenotype and subtype can be difficult. In this study, we examined the electronic health records of 4095 individuals with Down syndrome at the Children's Hospital of Philadelphia to create a method to characterize the phenotypic spectrum digitally. We extracted Human Phenotype Ontology (HPO) terms from quality-filtered patient notes using a natural language processing (NLP) approach MetaMap. We catalogued the most common HPO terms related to Down syndrome patients and compared the terms with those from a baseline population. We characterized the top 100 HPO terms by their frequencies at different ages of clinical visits and highlighted selected terms that have time-dependent distributions. We also discovered phenotypic terms that have not been significantly associated with Down syndrome, such as "Proptosis", "Downslanted palpebral fissures", and "Microtia". In summary, our study demonstrated that the clinical phenotypic spectrum of individual with Mendelian diseases can be characterized through NLP-based digital phenotyping on population-scale electronic health records (EHRs).
Project description:Korian is a private group specializing in medical accommodations for elderly and dependent people. A professional data warehouse (DWH) established in 2010 hosts all of the residents' data. Inside this information system (IS), clinical narratives (CNs) were used only by medical staff as a residents' care linking tool. The objective of this study was to show that, through qualitative and quantitative textual analysis of a relatively small physiotherapy and well-defined CN sample, it was possible to build a physiotherapy corpus and, through this process, generate a new body of knowledge by adding relevant information to describe the residents' care and lives.Meaningful words were extracted through Standard Query Language (SQL) with the LIKE function and wildcards to perform pattern matching, followed by text mining and a word cloud using R® packages. Another step involved principal components and multiple correspondence analyses, plus clustering on the same residents' sample as well as on other health data using a health model measuring the residents' care level needs.By combining these techniques, physiotherapy treatments could be characterized by a list of constructed keywords, and the residents' health characteristics were built. Feeding defects or health outlier groups could be detected, physiotherapy residents' data and their health data were matched, and differences in health situations showed qualitative and quantitative differences in physiotherapy narratives.This textual experiment using a textual process in two stages showed that text mining and data mining techniques provide convenient tools to improve residents' health and quality of care by adding new, simple, useable data to the electronic health record (EHR). When used with a normalized physiotherapy problem list, text mining through information extraction (IE), named entity recognition (NER) and data mining (DM) can provide a real advantage to describe health care, adding new medical material and helping to integrate the EHR system into the health staff work environment.
Project description:ObjectiveElectronic health records (EHR) are increasingly being recognized as a major source of data reusable for medical research and quality monitoring, although patient identification and assessment of symptoms (characterization) remain challenging, especially in complex diseases such as systemic lupus erythematosus (SLE). Current coding systems are unable to assess information recorded in the physician's free-text notes. This study shows that text mining can be used as a reliable alternative.MethodsIn a multidisciplinary research team of data scientists and medical experts, a text mining algorithm on 4607 patient records was developed to assess the diagnosis of 14 different immune-mediated inflammatory diseases and the presence of 18 different symptoms in the EHR. The text mining algorithm included key words in the EHR, while mining the context for exclusion phrases. The accuracy of the text mining algorithm was assessed by manually checking the EHR of 100 random patients suspected of having SLE for diagnoses and symptoms and comparing the outcome with the outcome of the text mining algorithm.ResultsAfter evaluation of 100 patient records, the text mining algorithm had a sensitivity of 96.4% and a specificity of 93.3% in assessing the presence of SLE. The algorithm detected potentially life-threatening symptoms (nephritis, pleuritis) with good sensitivity (80%-82%) and high specificity (97%-97%).ConclusionWe present a text mining algorithm that can accurately identify and characterize patients with SLE using routinely collected data from the EHR. Our study shows that using text mining, data from the EHR can be reused in research and quality control.
Project description:Analyses of search engine and social media feeds have been attempted for infectious disease outbreaks, but have been found to be susceptible to artefactual distortions from health scares or keyword spamming in social media or the public internet. We describe an approach using real-time aggregation of keywords and phrases of freetext from real-time clinician-generated documentation in electronic health records to produce a customisable real-time viral pneumonia signal providing up to 4 days warning for secondary care capacity planning. This low-cost approach is open-source, is locally customisable, is not dependent on any specific electronic health record system and can provide an ensemble of signals if deployed at multiple organisational scales.
Project description:BackgroundA cancer diagnosis is a source of psychological and emotional stress, which are often maintained for sustained periods of time that may lead to depressive disorders. Depression is one of the most common psychological conditions in patients with cancer. According to the Global Cancer Observatory, breast and colorectal cancers are the most prevalent cancers in both sexes and across all age groups in Spain.ObjectiveThis study aimed to compare the prevalence of depression in patients before and after the diagnosis of breast or colorectal cancer, as well as to assess the usefulness of the analysis of free-text clinical notes in 2 languages (Spanish or Catalan) for detecting depression in combination with encoded diagnoses.MethodsWe carried out an analysis of the electronic health records from a general hospital by considering the different sources of clinical information related to depression in patients with breast and colorectal cancer. This analysis included ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification) diagnosis codes and unstructured information extracted by mining free-text clinical notes via natural language processing tools based on Systematized Nomenclature of Medicine Clinical Terms that mentions symptoms and drugs used for the treatment of depression.ResultsWe observed that the percentage of patients diagnosed with depressive disorders significantly increased after cancer diagnosis in the 2 types of cancer considered-breast and colorectal cancers. We managed to identify a higher number of patients with depression by mining free-text clinical notes than the group selected exclusively on ICD-9-CM codes, increasing the number of patients diagnosed with depression by 34.8% (441/1269). In addition, the number of patients with depression who received chemotherapy was higher than those who did not receive this treatment, with significant differences (P<.001).ConclusionsThis study provides new clinical evidence of the depression-cancer comorbidity and supports the use of natural language processing for extracting and analyzing free-text clinical notes from electronic health records, contributing to the identification of additional clinical data that complements those provided by coded data to improve the management of these patients.
Project description:PurposeTo evaluate different clinical variants of pseudoexfoliation syndrome and their risk of developing ocular hypertension (OHT) or glaucoma (PXG).DesignCross sectional hospital based study.SettingAll patients seen at glaucoma services of a tertiary eye care center in east India.MethodsElectronic medical records search of hospital database including consecutive new and old cases seen during April 2013 to March 2015 was done to retrieve case sensitive words including pseudoexfoliation, PXF, PEX, PXG and pseudoexfoliative glaucoma over any part of the clinical electronic sheet of the patient. All demographic and clinical details including laterality, the pattern of deposits, need for medicines and disc damage at presentation was compared in eyes with radial pigmentary, classical or combined forms of PXF phenotypes.ResultsOf 110313 PXF patients seen during the period of 2013-2015, a total of 2297 eyes of 1150 PXF patients were identified including 525 unilateral PXF (meaning a total of 1775 PXF eyes with 625 patients having bilateral disease, n = 1250 eyes, other clinically normal eye, n = 522) at presentation. Of 525 unilateral PXF eyes, 105 had OHT and 131 had glaucoma while bilateral cases had more >50% (675 eyes of 1250 eyes) with glaucoma. Glaucoma with significant changes in IOP with or without disc damage was seen in 32% of pigmentary and 39% of classical PXF forms with eyes with combined forms of PXF having around 50% with glaucoma at presentation compared to other forms, p<0.001.ConclusionDifferent phenotypic variants of PXF in this Indian cohort was associated with 30-50% risk of OHT or glaucoma respectively. Adequate care is required while examining the pattern of PXF in each case to prognosticate each patient/eye.
Project description:The combination of improved genomic analysis methods, decreasing genotyping costs, and increasing computing resources has led to an explosion of clinical genomic knowledge in the last decade. Similarly, healthcare systems are increasingly adopting robust electronic health record (EHR) systems that not only can improve health care, but also contain a vast repository of disease and treatment data that could be mined for genomic research. Indeed, institutions are creating EHR-linked DNA biobanks to enable genomic and pharmacogenomic research, using EHR data for phenotypic information. However, EHRs are designed primarily for clinical care, not research, so reuse of clinical EHR data for research purposes can be challenging. Difficulties in use of EHR data include: data availability, missing data, incorrect data, and vast quantities of unstructured narrative text data. Structured information includes billing codes, most laboratory reports, and other variables such as physiologic measurements and demographic information. Significant information, however, remains locked within EHR narrative text documents, including clinical notes and certain categories of test results, such as pathology and radiology reports. For relatively rare observations, combinations of simple free-text searches and billing codes may prove adequate when followed by manual chart review. However, to extract the large cohorts necessary for genome-wide association studies, natural language processing methods to process narrative text data may be needed. Combinations of structured and unstructured textual data can be mined to generate high-validity collections of cases and controls for a given condition. Once high-quality cases and controls are identified, EHR-derived cases can be used for genomic discovery and validation. Since EHR data includes a broad sampling of clinically-relevant phenotypic information, it may enable multiple genomic investigations upon a single set of genotyped individuals. This chapter reviews several examples of phenotype extraction and their application to genetic research, demonstrating a viable future for genomic discovery using EHR-linked data.
Project description:Blowfly strike is a devastating and often rapidly fatal disease in rabbits. In Great Britain (GB), Lucilia sericata is the primary causative species. Despite its severity, there has been minimal investigatory work into the disease in rabbits. Here we used text mining to screen electronic health records (EHRs) from a large sentinel network of 389 veterinary practices in GB between March 2014 and April 2017 for confirmed cases of blowfly strike in rabbits. Blowfly strike was identified in 243 of 42,226 rabbit consultations (0.6%), affecting 205 individual rabbits. The anatomical site of recorded blowfly strike lesions was overwhelmingly the perineal area (n?=?109, 52.4%). Less commonly lesions were observed affecting other areas of the body (n?=?9, 4.3%) and head (n?=?8, 3.8%); in 83 consultations (39.9%), the affected area was not specified. Of the rabbits presenting with blowfly strike, 44.7% were recorded as being euthanized or died. A case control study was used to identify risk factors for blowfly strike in this population. Whilst sex and neuter status in isolation were not significantly associated with blowfly strike, entire female rabbits showed a 3.3 times greater odds of being a case than neutered female rabbits. Rabbits five years of age and over were more than 3.8 times likely to present for blowfly strike. For every 1?°C rise in environmental temperature between 4.67?°C and 17.68?°C, there was a 33% increase risk of blowfly strike, with cases peaking in July or August. Overall blowfly strike cases started earlier and peaked higher in the south of Great Britain. The most northerly latitude studied was at lower risk of blowfly strike than the most southerly (OR?=?0.50, p?<?0.001). There appeared to be no significant relationship between blowfly strike in rabbits and either the sheep density or rural and urban land coverage types. The results presented here can be used for targeted health messaging to reduce the impact of this deadly disease for rabbits. We propose that real-time temporal and spatial surveillance of the rabbit disease may also help inform sheep control, where the seasonal profile is very similar, and where routine surveillance data is also not available. Our results highlight the value of sentinel databases based on EHRs for research and surveillance.
Project description:Intimate partner violence (IPV) is often studied as a problem that predominantly affects younger women. However, studies show that older women are also frequently victims of abuse even though the physical effects of abuse are harder to detect. In this study, we mined the electronic health records (EHR) available through IBM Explorys to identify health correlates of IPV that are specific to older women. Our analyses suggested that diagnostic terms that are co-morbid with IPV in older women are dominated by substance abuse and associated toxicities. When we considered differential co-morbidity, i.e., terms that are significantly more associated with IPV in older women compared to younger women, we identified terms spanning mental health issues, musculoskeletal issues, neoplasms, and disorders of various organ systems including skin, ears, nose and throat. Our findings provide pointers for further investigation in understanding the health effects of IPV among older women, as well as potential markers that can be used for screening IPV.
Project description:With an aging patient population and increasing complexity in patient disease trajectories, physicians are often met with complex patient histories from which clinical decisions must be made. Due to the increasing rate of adverse events and hospitals facing financial penalties for readmission, there has never been a greater need to enforce evidence-led medical decision-making using available health care data. In the present work, we studied a cohort of 7,741 patients, of whom 4,080 were diagnosed with cancer, surgically treated at a University Hospital in the years 2004-2012. We have developed a methodology that allows disease trajectories of the cancer patients to be estimated from free text in electronic health records (EHRs). By using these disease trajectories, we predict 80% of patient events ahead in time. By control of confounders from 8326 quantified events, we identified 557 events that constitute high subsequent risks (risk > 20%), including six events for cancer and seven events for metastasis. We believe that the presented methodology and findings could be used to improve clinical decision support and personalize trajectories, thereby decreasing adverse events and optimizing cancer treatment.