Dataset Information

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.

ABSTRACT: OBJECTIVE:Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites. MATERIALS AND METHODS:We reviewed the EHRs of 631 individuals followed at Vanderbilt for hypertension status. We developed features and phenotyping algorithms of increasing complexity. Input categories included International Classification of Diseases, Ninth Revision (ICD9) codes, medications, vital signs, narrative-text search results, and Unified Medical Language System (UMLS) concepts extracted using natural language processing (NLP). We developed a module and tested portability by replicating 10 of the best-performing algorithms at the Marshfield Clinic. RESULTS:Random forests using billing codes, medications, vitals, and concepts had the best performance with a median area under the receiver operator characteristic curve (AUC) of 0.976. Normalized sums of all 4 categories also performed well (0.959 AUC). The best non-NLP algorithm combined normalized ICD9 codes, medications, and blood pressure readings with a median AUC of 0.948. Blood pressure cutoffs or ICD9 code counts alone had AUCs of 0.854 and 0.908, respectively. Marshfield Clinic results were similar. CONCLUSION:This work shows that billing codes or blood pressure readings alone yield good hypertension classification performance. However, even simple combinations of input categories improve performance. The most complex algorithms classified hypertension with excellent recall and precision.

SUBMITTER: Teixeira PL

PROVIDER: S-EPMC5201185 | biostudies-literature | 2017 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.

Teixeira Pedro L PL Wei Wei-Qi WQ Cronin Robert M RM Mo Huan H VanHouten Jacob P JP Carroll Robert J RJ LaRose Eric E Bastarache Lisa A LA Rosenbloom S Trent ST Edwards Todd L TL Roden Dan M DM Lasko Thomas A TA Dart Richard A RA Nikolai Anne M AM Peissig Peggy L PL Denny Joshua C JC

Journal of the American Medical Informatics Association : JAMIA 20160807 1

<h4>Objective</h4>Phenotyping algorithms applied to electronic health record (EHR) data enable investigators to identify large cohorts for clinical and genomic research. Algorithm development is often iterative, depends on fallible investigator intuition, and is time- and labor-intensive. We developed and evaluated 4 types of phenotyping algorithms and categories of EHR information to identify hypertensive individuals and controls and provide a portable module for implementation at other sites.< ...[more]

PMID: 27497800

Similar Datasets

Project description:ObjectiveTo identify observational studies which used data from more than one primary care electronic health record (EHR) database, and summarise key characteristics including: objective and rationale for using multiple data sources; methods used to manage, analyse and (where applicable) combine data; and approaches used to assess and report heterogeneity between data sources.DesignA systematic review of published studies.Data sourcesPubmed and Embase databases were searched using list of named primary care EHR databases; supplementary hand searches of reference list of studies were retained after initial screening.Study selectionObservational studies published between January 2000 and May 2018 were selected, which included at least two different primary care EHR databases.Results6054 studies were identified from database and hand searches, and 109 were included in the final review, the majority published between 2014 and 2018. Included studies used 38 different primary care EHR data sources. Forty-seven studies (44%) were descriptive or methodological. Of 62 analytical studies, 22 (36%) presented separate results from each database, with no attempt to combine them; 29 (48%) combined individual patient data in a one-stage meta-analysis and 21 (34%) combined estimates from each database using two-stage meta-analysis. Discussion and exploration of heterogeneity was inconsistent across studies.ConclusionsComparing patterns and trends in different populations, or in different primary care EHR databases from the same populations, is important and a common objective for multi-database studies. When combining results from several databases using meta-analysis, provision of separate results from each database is helpful for interpretation. We found that these were often missing, particularly for studies using one-stage approaches, which also often lacked details of any statistical adjustment for heterogeneity and/or clustering. For two-stage meta-analysis, a clear rationale should be provided for choice of fixed effect and/or random effects or other models.

Project description:ObjectivesTo develop, validate, and implement algorithms to identify diabetic retinopathy (DR) cases and controls from electronic health care records (EHRs).Materials and methodsWe developed and validated electronic health record (EHR)-based algorithms to identify DR cases and individuals with type I or II diabetes without DR (controls) in 3 independent EHR systems: Vanderbilt University Medical Center Synthetic Derivative (VUMC), the VA Northeast Ohio Healthcare System (VANEOHS), and Massachusetts General Brigham (MGB). Cases were required to meet 1 of the following 3 criteria: (1) 2 or more dates with any DR ICD-9/10 code documented in the EHR, (2) at least one affirmative health-factor or EPIC code for DR along with an ICD9/10 code for DR on a different day, or (3) at least one ICD-9/10 code for any DR occurring within 24 hours of an ophthalmology examination. Criteria for controls included affirmative evidence for diabetes as well as an ophthalmology examination.ResultsThe algorithms, developed and evaluated in VUMC through manual chart review, resulted in a positive predictive value (PPV) of 0.93 for cases and negative predictive value (NPV) of 0.91 for controls. Implementation of algorithms yielded similar metrics in VANEOHS (PPV = 0.94; NPV = 0.86) and lower in MGB (PPV = 0.84; NPV = 0.76). In comparison, the algorithm for DR implemented in Phenome-wide association study (PheWAS) in VUMC yielded similar PPV (0.92) but substantially reduced NPV (0.48). Implementation of the algorithms to the Million Veteran Program identified over 62 000 DR cases with genetic data including 14 549 African Americans and 6209 Hispanics with DR.Conclusions/discussionWe demonstrate the robustness of the algorithms at 3 separate healthcare centers, with a minimum PPV of 0.84 and substantially improved NPV than existing automated methods. We strongly encourage independent validation and incorporation of features unique to each EHR to enhance algorithm performance for DR cases and controls.

Project description:AimsUnderstanding atypical forms of diabetes (AD) may advance precision medicine, but methods to identify such patients are needed. We propose an electronic health record (EHR)-based algorithmic approach to identify patients who may have AD, specifically those with insulin-sufficient, non-metabolic diabetes, in order to improve feasibility of identifying these patients through detailed chart review.MethodsPatients with likely T2D were selected using a validated machine-learning (ML) algorithm applied to EHR data. "Typical" T2D cases were removed by excluding individuals with obesity, evidence of dyslipidemia, antibody-positive diabetes, or cystic fibrosis. To filter out likely type 1 diabetes (T1D) cases, we applied six additional "branch algorithms," relying on various clinical characteristics, which resulted in six overlapping cohorts. Diabetes type was classified by manual chart review as atypical, not atypical, or indeterminate due to missing information.ResultsOf 114,975 biobank participants, the algorithms collectively identified 119 (0.1%) potential AD cases, of which 16 (0.014%) were confirmed after expert review. The branch algorithm that excluded T1D based on outpatient insulin use had the highest percentage yield of AD (13 of 27; 48.2% yield). Together, the 16 AD cases had significantly lower BMI and higher HDL than either unselected T1D or T2D cases identified by ML algorithms (P<0.05). Compared to the ML T1D group, the AD group had a significantly higher T2D polygenic score (P<0.01) and lower hemoglobin A1c (P<0.01).ConclusionOur EHR-based algorithms followed by manual chart review identified collectively 16 individuals with AD, representing 0.22% of biobank enrollees with T2D. With a maximum yield of 48% cases after manual chart review, our algorithms have the potential to drastically improve efficiency of AD identification. Recognizing patients with AD may inform on the heterogeneity of T2D and facilitate enrollment in studies like the Rare and Atypical Diabetes Network (RADIANT).

Project description:ImportanceAccurate, real-time case identification is needed to target interventions to improve quality and outcomes for hospitalized patients with heart failure. Problem lists may be useful for case identification but are often inaccurate or incomplete. Machine-learning approaches may improve accuracy of identification but can be limited by complexity of implementation.ObjectiveTo develop algorithms that use readily available clinical data to identify patients with heart failure while in the hospital.Design, setting, and participantsWe performed a retrospective study of hospitalizations at an academic medical center. Hospitalizations for patients 18 years or older who were admitted after January 1, 2013, and discharged before February 28, 2015, were included. From a random 75% sample of hospitalizations, we developed 5 algorithms for heart failure identification using electronic health record data: (1) heart failure on problem list; (2) presence of at least 1 of 3 characteristics: heart failure on problem list, inpatient loop diuretic, or brain natriuretic peptide level of 500 pg/mL or higher; (3) logistic regression of 30 clinically relevant structured data elements; (4) machine-learning approach using unstructured notes; and (5) machine-learning approach using structured and unstructured data.Main outcomes and measuresHeart failure diagnosis based on discharge diagnosis and physician review of sampled medical records.ResultsA total of 47 119 hospitalizations were included in this study (mean [SD] age, 60.9 [18.15] years; 23 952 female [50.8%], 5258 black/African American [11.2%], and 3667 Hispanic/Latino [7.8%] patients). Of these hospitalizations, 6549 (13.9%) had a discharge diagnosis of heart failure. Inclusion of heart failure on the problem list (algorithm 1) had a sensitivity of 0.40 and a positive predictive value (PPV) of 0.96 for heart failure identification. Algorithm 2 improved sensitivity to 0.77 at the expense of a PPV of 0.64. Algorithms 3, 4, and 5 had areas under the receiver operating characteristic curves of 0.953, 0.969, and 0.974, respectively. With a PPV of 0.9, these algorithms had associated sensitivities of 0.68, 0.77, and 0.83, respectively.Conclusions and relevanceThe problem list is insufficient for real-time identification of hospitalized patients with heart failure. The high predictive accuracy of machine learning using free text demonstrates that support of such analytics in future electronic health record systems can improve cohort identification.

Project description:BackgroundThe rarity of pediatric glomerular disease makes it difficult to identify sufficient numbers of participants for clinical trials. This leaves limited data to guide improvements in care for these patients.MethodsThe authors developed and tested an electronic health record (EHR) algorithm to identify children with glomerular disease. We used EHR data from 231 patients with glomerular disorders at a single center to develop a computerized algorithm comprising diagnosis, kidney biopsy, and transplant procedure codes. The algorithm was tested using PEDSnet, a national network of eight children's hospitals with data on >6.5 million children. Patients with three or more nephrologist encounters (n=55,560) not meeting the computable phenotype definition of glomerular disease were defined as nonglomerular cases. A reviewer blinded to case status used a standardized form to review random samples of cases (n=800) and nonglomerular cases (n=798).ResultsThe final algorithm consisted of two or more diagnosis codes from a qualifying list or one diagnosis code and a pretransplant biopsy. Performance characteristics among the population with three or more nephrology encounters were sensitivity, 96% (95% CI, 94% to 97%); specificity, 93% (95% CI, 91% to 94%); positive predictive value (PPV), 89% (95% CI, 86% to 91%); negative predictive value, 97% (95% CI, 96% to 98%); and area under the receiver operating characteristics curve, 94% (95% CI, 93% to 95%). Requiring that the sum of nephrotic syndrome diagnosis codes exceed that of glomerulonephritis codes identified children with nephrotic syndrome or biopsy-based minimal change nephropathy, FSGS, or membranous nephropathy, with 94% sensitivity and 92% PPV. The algorithm identified 6657 children with glomerular disease across PEDSnet, ≥50% of whom were seen within 18 months.ConclusionsThe authors developed an EHR-based algorithm and demonstrated that it had excellent classification accuracy across PEDSnet. This tool may enable faster identification of cohorts of pediatric patients with glomerular disease for observational or prospective studies.

Dataset Information

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.

Publications

Evaluating electronic health record data sources and algorithmic approaches to identify hypertensive individuals.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets