Dataset Information

Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods.

ABSTRACT:

Background

Accurate identification of acute ischemic stroke (AIS) patient cohorts is essential for a wide range of clinical investigations. Automated phenotyping methods that leverage electronic health records (EHRs) represent a fundamentally new approach cohort identification without current laborious and ungeneralizable generation of phenotyping algorithms. We systematically compared and evaluated the ability of machine learning algorithms and case-control combinations to phenotype acute ischemic stroke patients using data from an EHR.

Materials and methods

Using structured patient data from the EHR at a tertiary-care hospital system, we built and evaluated machine learning models to identify patients with AIS based on 75 different case-control and classifier combinations. We then estimated the prevalence of AIS patients across the EHR. Finally, we externally validated the ability of the models to detect AIS patients without AIS diagnosis codes using the UK Biobank.

Results

Across all models, we found that the mean AUROC for detecting AIS was 0.963?±?0.0520 and average precision score 0.790?±?0.196 with minimal feature processing. Classifiers trained with cases with AIS diagnosis codes and controls with no cerebrovascular disease codes had the best average F1 score (0.832?±?0.0383). In the external validation, we found that the top probabilities from a model-predicted AIS cohort were significantly enriched for AIS patients without AIS diagnosis codes (60-150 fold over expected).

Conclusions

Our findings support machine learning algorithms as a generalizable way to accurately identify AIS patients without using process-intensive manual feature curation. When a set of AIS patients is unavailable, diagnosis codes may be used to train classifier models.

SUBMITTER: Thangaraj PM

PROVIDER: S-EPMC7720570 | biostudies-literature | 2020 Dec

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods.

Thangaraj Phyllis M PM Kummer Benjamin R BR Lorberbaum Tal T Elkind Mitchell S V MSV Tatonetti Nicholas P NP

BioData mining 20201207 1

<h4>Background</h4>Accurate identification of acute ischemic stroke (AIS) patient cohorts is essential for a wide range of clinical investigations. Automated phenotyping methods that leverage electronic health records (EHRs) represent a fundamentally new approach cohort identification without current laborious and ungeneralizable generation of phenotyping algorithms. We systematically compared and evaluated the ability of machine learning algorithms and case-control combinations to phenotype acu ...[more]

PMID: 33372632

Similar Datasets

Project description:IntroductionElectronic health records (EHR) are linked together to examine disease history and to undertake research into the causes and outcomes of disease. However, the process of constructing algorithms for phenotyping (e.g., identifying disease characteristics) or health characteristics (e.g., smoker) is very time consuming and resource costly. In addition, results can vary greatly between researchers. Reusing or building on algorithms that others have created is a compelling solution to these problems. However, sharing algorithms is not a common practice and many published studies do not detail the clinical code lists used by the researchers in the disease/characteristic definition. To address these challenges, a number of centres across the world have developed health data portals which contain concept libraries (e.g., algorithms for defining concepts such as disease and characteristics) in order to facilitate disease phenotyping and health studies.ObjectivesThis study aims to review the literature of existing concept libraries, examine their utilities, identify the current gaps, and suggest future developments.MethodsThe five-stage framework of Arksey and O'Malley was used for the literature search. This approach included defining the research questions, identifying relevant studies through literature review, selecting eligible studies, charting and extracting data, and summarising and reporting the findings.ResultsThis review identified seven publicly accessible Electronic Health data concept libraries which were developed in different countries including UK, USA, and Canada. The concept libraries (n = 7) investigated were either general libraries that hold phenotypes of multiple specialties (n = 4) or specialized libraries that manage only certain specialities such as rare diseases (n = 3). There were some clear differences between the general libraries such as archiving data from different electronic sources, and using a range of different types of coding systems. However, they share some clear similarities such as enabling users to upload their own code lists, and allowing users to use/download the publicly accessible code. In addition, there were some differences between the specialized libraries such as difference in ability to search, and if it was possible to use different searching queries such as simple or complex searches. Conversely, there were some similarities between the specialized libraries such as enabling users to upload their own concepts into the libraries and to show where they were published, which facilitates assessing the validity of the concepts. All the specialized libraries aimed to encourage the reuse of research methods such as lists of clinical code and/or metadata.ConclusionThe seven libraries identified have been developed independently and appear to replicate similar concepts but in different ways. Collaboration between similar libraries would greatly facilitate the use of these libraries for the user. The process of building code lists takes time and effort. Access to existing code lists increases consistency and accuracy of definitions across studies. Concept library developers should collaborate with each other to raise awareness of their existence and of their various functions, which could increase users' contributions to those libraries and promote their wide-ranging adoption.

Project description:ObjectiveElectronic health records (EHR) offer medical and pharmacogenomics research unprecedented opportunities to identify and classify patients at risk. EHRs are collections of highly inter-dependent records that include biological, anatomical, physiological, and behavioral observations. They comprise a patient's clinical phenome, where each patient has thousands of date-stamped records distributed across many relational tables. Development of EHR computer-based phenotyping algorithms require time and medical insight from clinical experts, who most often can only review a small patient subset representative of the total EHR records, to identify phenotype features. In this research we evaluate whether relational machine learning (ML) using inductive logic programming (ILP) can contribute to addressing these issues as a viable approach for EHR-based phenotyping.MethodsTwo relational learning ILP approaches and three well-known WEKA (Waikato Environment for Knowledge Analysis) implementations of non-relational approaches (PART, J48, and JRIP) were used to develop models for nine phenotypes. International Classification of Diseases, Ninth Revision (ICD-9) coded EHR data were used to select training cohorts for the development of each phenotypic model. Accuracy, precision, recall, F-Measure, and Area Under the Receiver Operating Characteristic (AUROC) curve statistics were measured for each phenotypic model based on independent manually verified test cohorts. A two-sided binomial distribution test (sign test) compared the five ML approaches across phenotypes for statistical significance.ResultsWe developed an approach to automatically label training examples using ICD-9 diagnosis codes for the ML approaches being evaluated. Nine phenotypic models for each ML approach were evaluated, resulting in better overall model performance in AUROC using ILP when compared to PART (p=0.039), J48 (p=0.003) and JRIP (p=0.003).DiscussionILP has the potential to improve phenotyping by independently delivering clinically expert interpretable rules for phenotype definitions, or intuitive phenotypes to assist experts.ConclusionRelational learning using ILP offers a viable approach to EHR-driven phenotyping.

Project description:ImportanceSuicide is a leading cause of death among young people. Accurate detection of self-injurious thoughts and behaviors (SITB) underpins equity in youth suicide prevention.ObjectivesTo compare methods of detecting SITB using structured electronic health information and measure algorithmic performance across demographics.Design, setting, and participantsThis cross-sectional study used medical records among youths aged 6 to 17 years with at least 1 mental health-related emergency department (ED) visit in 2017 to 2019 to an academic health system in Southern California serving 787 000 unique individuals each year. Analyses were conducted between January and September 2023.ExposuresMultiexpert electronic health record review ascertained the presence of SITB using the Columbia Classification Algorithm of Suicide Assessment. Random forest classifiers with nested cross-validation were developed using (1) International Statistical Classification of Diseases, Tenth Revision, Clinical Modification (ICD-10-CM) codes for nonfatal suicide attempt and self-harm and chief concern and (2) all available structured data, including diagnoses, medications, and laboratory tests.Main outcome and measuresDetection performance was assessed overall and stratified by age group, sex, and race and ethnicity.ResultsThe sample comprised 2702 unique youths with an MH-related ED visit (1384 youths who identified as female [51.2%]; 131 Asian [4.8%], 266 Black [9.8%], 719 Hispanic [26.6%], 1319 White [48.8%], and 233 other race [8.6%]; median [IQR] age, 14 [12-16] years), including 898 children and 1804 adolescents. Approximately half of visits were related to SITB (1286 visits [47.6%]). Sensitivity of SITB detection using only codes and chief concern varied by age group and increased until age 15 years (6-9 years: 59.3% [95% CI, 48.5%-69.5%]; 10-12 years: 69.0% [95% CI, 63.8%-73.9%]; 13-15 years: 88.4% [95% CI, 85.1%-91.2%]; 16-17 years: 83.1% [95% CI, 79.1%-86.6%]), while specificity remained constant. The area under the receiver operating characteristic curve (AUROC) was lower among preadolescents (0.841 [95% CI, 0.815-0.867]) and male (0.869 [95% CI, 0.848-0.890]), Black (0.859 [95% CI, 0.813-0.905]), and Hispanic (0.861 [95% CI, 0.831-0.891]) youths compared with adolescents (0.925 [95% CI, 0.912-0.938]), female youths (0.923 [95% CI, 0.909-0.937]), and youths of other races and ethnicities (eg, White: 0.901 [95% CI, 0.884-0.918]). Augmented classification (ie, using all available structured data) outperformed classification with codes and chief concern alone (AUROC, 0.975 [95% CI, 0.968-0.980] vs 0.894 [95% CI, 0.882-0.905]; P < .001).Conclusions and relevanceIn this study, diagnostic codes and chief concern underestimated SITB prevalence, particularly among minoritized youths. These results suggest that priority on algorithmic fairness in suicide prevention strategies must extend to accurate detection of youths with suicide-related emergencies.

Project description:To develop and evaluate a novel strategy that automates the retrospective identification of sepsis using electronic health record data.DesignRetrospective cohort study of emergency department and in-hospital patient encounters from 2014 to 2018.SettingOne community and two academic hospitals in Maryland.PatientsAll patients 18 years old or older presenting to the emergency department or admitted to any acute inpatient medical or surgical unit including patients discharged from the emergency department.InterventionsNone.Measurements and main resultsFrom the electronic health record, 233,252 emergency department and inpatient encounters were identified. Patient data were used to develop and validate electronic health record-based sepsis phenotyping, an adaptation of "the Centers for Disease Control Adult Sepsis Event toolkit" that accounts for comorbid conditions when identifying sepsis patients. The performance of this novel system was then compared with 1) physician case review and 2) three other commonly used strategies using metrics of sensitivity and precision relative to sepsis billing codes, termed "billing code sensitivity" and "billing code predictive value." Physician review of electronic health record-based sepsis phenotyping identified cases confirmed 79% as having sepsis; 88% were confirmed or had a billing code for sepsis; and 99% were confirmed, had a billing code, or received at least 4 days of antibiotics. At comparable billing code sensitivity (0.91; 95% CI, 0.88-0.93), electronic health record-based sepsis phenotyping had a higher billing code predictive value (0.32; 95% CI, 0.30-0.34) than either the Centers for Medicare and Medicaid Services Sepsis Core Measure (SEP-1) definition or the Sepsis-3 consensus definition (0.12; 95% CI, 0.11-0.13; and 0.07; 95% CI, 0.07-0.08, respectively). When compared with electronic health record-based sepsis phenotyping, Adult Sepsis Event had a lower billing code sensitivity (0.75; 95% CI, 0.72-0.78) and similar billing code predictive value (0.29; 95% CI, 0.26-0.31). Electronic health record-based sepsis phenotyping identified patients with higher in-hospital mortality and nearly one-half as many false-positive cases when compared with SEP-1 and Sepsis-3.ConclusionsBy accounting for comorbid conditions, electronic health record-based sepsis phenotyping exhibited better performance when compared with other automated definitions of sepsis.

Project description:Introduction:Electronic health record (EHR)-driven phenotyping is a critical first step in generating biomedical knowledge from EHR data. Despite recent progress, current phenotyping approaches are manual, time-consuming, error-prone, and platform-specific. This results in duplication of effort and highly variable results across systems and institutions, and is not scalable or portable. In this work, we investigate how the nascent Clinical Quality Language (CQL) can address these issues and enable high-throughput, cross-platform phenotyping. Methods:We selected a clinically validated heart failure (HF) phenotype definition and translated it into CQL, then developed a CQL execution engine to integrate with the Observational Health Data Sciences and Informatics (OHDSI) platform. We executed the phenotype definition at two large academic medical centers, Northwestern Medicine and Weill Cornell Medicine, and conducted results verification (n = 100) to determine precision and recall. We additionally executed the same phenotype definition against two different data platforms, OHDSI and Fast Healthcare Interoperability Resources (FHIR), using the same underlying dataset and compared the results. Results:CQL is expressive enough to represent the HF phenotype definition, including Boolean and aggregate operators, and temporal relationships between data elements. The language design also enabled the implementation of a custom execution engine with relative ease, and results verification at both sites revealed that precision and recall were both 100%. Cross-platform execution resulted in identical patient cohorts generated by both data platforms. Conclusions:CQL supports the representation of arbitrarily complex phenotype definitions, and our execution engine implementation demonstrated cross-platform execution against two widely used clinical data platforms. The language thus has the potential to help address current limitations with portability in EHR-driven phenotyping and scale in learning health systems.

Dataset Information

Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods.

Background

Materials and methods

Results

Conclusions

Publications

Comparative analysis, applications, and interpretation of electronic health record-based stroke phenotyping methods.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets