Dataset Information

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study.

ABSTRACT:

Background

Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries.

Objective

The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capable of accurately identifying patients with rheumatoid arthritis from format-free text fields in electronic health records.

Methods

Two electronic health record data sets were employed: Leiden (n=3000) and Erlangen (n=4771). Using a portion of the Leiden data (n=2000), we compared 6 different machine learning methods and a naïve word-matching algorithm using 10-fold cross-validation. Performances were compared using the area under the receiver operating characteristic curve (AUROC) and the area under the precision recall curve (AUPRC), and F1 score was used as the primary criterion for selecting the best method to build a classifying algorithm. We selected the optimal threshold of positive predictive value for case identification based on the output of the best method in the training data. This validation workflow was subsequently applied to a portion of the Erlangen data (n=4293). For testing, the best performing methods were applied to remaining data (Leiden n=1000; Erlangen n=478) for an unbiased evaluation.

Results

For the Leiden data set, the word-matching algorithm demonstrated mixed performance (AUROC 0.90; AUPRC 0.33; F1 score 0.55), and 4 methods significantly outperformed word-matching, with support vector machines performing best (AUROC 0.98; AUPRC 0.88; F1 score 0.83). Applying this support vector machine classifier to the test data resulted in a similarly high performance (F1 score 0.81; positive predictive value [PPV] 0.94), and with this method, we could identify 2873 patients with rheumatoid arthritis in less than 7 seconds out of the complete collection of 23,300 patients in the Leiden electronic health record system. For the Erlangen data set, gradient boosting performed best (AUROC 0.94; AUPRC 0.85; F1 score 0.82) in the training set, and applied to the test data, resulted once again in good results (F1 score 0.67; PPV 0.97).

Conclusions

We demonstrate that machine learning methods can extract the records of patients with rheumatoid arthritis from electronic health record data with high precision, allowing research on very large populations for limited costs. Our approach is language and center independent and could be applied to any type of diagnosis. We have developed our pipeline into a universally applicable and easy-to-implement workflow to equip centers with their own high-performing algorithm. This allows the creation of observational studies of unprecedented size covering different countries for low cost from already available data in electronic health record systems.

SUBMITTER: Maarseveen TD

PROVIDER: S-EPMC7735897 | biostudies-literature | 2020 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study.

Maarseveen Tjardo D TD Meinderink Timo T Reinders Marcel J T MJT Knitza Johannes J Huizinga Tom W J TWJ Kleyer Arnd A Simon David D van den Akker Erik B EB Knevel Rachel R

JMIR medical informatics 20201130 11

<h4>Background</h4>Financial codes are often used to extract diagnoses from electronic health records. This approach is prone to false positives. Alternatively, queries are constructed, but these are highly center and language specific. A tantalizing alternative is the automatic identification of patients by employing machine learning on format-free text entries.<h4>Objective</h4>The aim of this study was to develop an easily implementable workflow that builds a machine learning algorithm capabl ...[more]

PMID: 33252349

Similar Datasets

Project description:BackgroundA major problem in treating acute kidney injury (AKI) is that clinical criteria for recognition are markers of established kidney damage or impaired function; treatment before such damage manifests is desirable. Clinicians could intervene during what may be a crucial stage for preventing permanent kidney injury if patients with incipient AKI and those at high risk of developing AKI could be identified.ObjectiveIn this study, we evaluate a machine learning algorithm for early detection and prediction of AKI.DesignWe used a machine learning technique, boosted ensembles of decision trees, to train an AKI prediction tool on retrospective data taken from more than 300 000 inpatient encounters.SettingData were collected from inpatient wards at Stanford Medical Center and intensive care unit patients at Beth Israel Deaconess Medical Center.PatientsPatients older than the age of 18 whose hospital stays lasted between 5 and 1000 hours and who had at least one documented measurement of heart rate, respiratory rate, temperature, serum creatinine (SCr), and Glasgow Coma Scale (GCS).MeasurementsWe tested the algorithm's ability to detect AKI at onset and to predict AKI 12, 24, 48, and 72 hours before onset.MethodsWe tested AKI detection and prediction using the National Health Service (NHS) England AKI Algorithm as a gold standard. We additionally tested the algorithm's ability to detect AKI as defined by the Kidney Disease: Improving Global Outcomes (KDIGO) guidelines. We compared the algorithm's 3-fold cross-validation performance to the Sequential Organ Failure Assessment (SOFA) score for AKI identification in terms of area under the receiver operating characteristic (AUROC).ResultsThe algorithm demonstrated high AUROC for detecting and predicting NHS-defined AKI at all tested time points. The algorithm achieves AUROC of 0.872 (95% confidence interval [CI], 0.867-0.878) for AKI detection at time of onset. For prediction 12 hours before onset, the algorithm achieves an AUROC of 0.800 (95% CI, 0.792-0.809). For 24-hour predictions, the algorithm achieves AUROC of 0.795 (95% CI, 0.785-0.804). For 48-hour and 72-hour predictions, the algorithm achieves AUROC values of 0.761 (95% CI, 0.753-0.768) and 0.728 (95% CI, 0.719-0.737), respectively.LimitationsBecause of the retrospective nature of this study, we cannot draw any conclusions about the impact the algorithm's predictions will have on patient outcomes in a clinical setting.ConclusionsThe results of these experiments suggest that a machine learning-based AKI prediction tool may offer important prognostic capabilities for determining which patients are likely to suffer AKI, potentially allowing clinicians to intervene before kidney damage manifests.

Project description:BackgroundRheumatoid Arthritis (RA) is a chronic inflammatory disease that is primarily diagnosed and managed by rheumatologists; however, it is often primary care providers who first encounter RA-related symptoms. This study developed and validated a case definition for RA using national surveillance data in primary care settings.MethodsThis cross-sectional validation study used structured electronic medical record (EMR) data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN). Based on the reference set generated by EMR reviews by five experts, three machine learning steps: 'bag-of-words' approach to feature generation, feature reduction using a feature importance measure coupled with recursive feature elimination and clustering, and classification using tree-based methods (Decision Tree, Random Forest, and Extreme Gradient Boosting). The three tree-based algorithms were compared to identify the procedure that generated the optimal evaluation metrics. Nested cross-validation was used to allow evaluation and comparison and tuning of models simultaneously.ResultsOf 1.3 million patients from seven Canadian provinces, 5,600 people aged 19 + were randomly selected. The optimal algorithm for selecting RA cases was generated by the XGBoost classification method. Based on feature importance scores for features in the XGBoost output, a human-readable case definition was created, where RA cases are identified when there are at least 2 occurrences of text "rheumatoid" in any billing, encounter diagnosis, or health condition table of the patient chart. The final case definition had sensitivity of 81.6% (95% CI, 75.6-86.4), specificity of 98.0% (95% CI, 97.4-98.5), positive predicted value of 76.3% (95% CI, 70.1-81.5), and negative predicted value of 98.6% (95% CI, 98.0-98.6).ConclusionA case definition for RA in using primary care EMR data was developed based off the XGBoost algorithm. With high validity metrics, this case definition is expected to be a reliable tool for future epidemiological research and surveillance investigating the management of RA in CPCSSN dataset.

Project description:BackgroundTo provide quality care, modern health care systems must match and link data about the same patient from multiple sources, a function often served by master patient index (MPI) software. Record linkage in the MPI is typically performed manually by health care providers, guided by automated matching algorithms. These matching algorithms must be configured in advance, such as by setting the weights of patient attributes, usually by someone with knowledge of both the matching algorithm and the patient population being served.ObjectiveWe aimed to develop and evaluate a machine learning-based software tool, which automatically configures a patient matching algorithm by learning from pairs of patient records previously linked by humans already present in the database.MethodsWe built a free and open-source software tool to optimize record linkage algorithm parameters based on historical record linkages. The tool uses Bayesian optimization to identify the set of configuration parameters that lead to optimal matching performance in a given patient population, by learning from prior record linkages by humans. The tool is written assuming only the existence of a minimal HTTP application programming interface (API), and so is agnostic to the choice of MPI software, record linkage algorithm, and patient population. As a proof of concept, we integrated our tool with SantéMPI, an open-source MPI. We validated the tool using several synthetic patient populations in SantéMPI by comparing the performance of the optimized configuration in held-out data to SantéMPI's default matching configuration using sensitivity and specificity.ResultsThe machine learning-optimized configurations correctly detect over 90% of true record linkages as definite matches in all data sets, with 100% specificity and positive predictive value in all data sets, whereas the baseline detects none. In the largest data set examined, the baseline matching configuration detects possible record linkages with a sensitivity of 90.2% (95% CI 88.4%-92.0%) and specificity of 100%. By comparison, the machine learning-optimized matching configuration attains a sensitivity of 100%, with a decreased specificity of 95.9% (95% CI 95.9%-96.0%). We report significant gains in sensitivity in all data sets examined, at the cost of only marginally decreased specificity. The configuration optimization tool, data, and data set generator have been made freely available.ConclusionsOur machine learning software tool can be used to significantly improve the performance of existing record linkage algorithms, without knowledge of the algorithm being used or specific details of the patient population being served.

Dataset Information

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study.

Background

Objective

Methods

Results

Conclusions

Publications

Machine Learning Electronic Health Record Identification of Patients with Rheumatoid Arthritis: Algorithm Pipeline Development and Validation Study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets