Dataset Information

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling.

ABSTRACT:

Objectives

The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap remains concerning electronic health records (EHRs) stored in star schema clinical data warehouses, an approach often adopted in practice. In this article, we introduce the FlexIBle EHR Retrieval (FIBER) tool: a Python library built on top of a star schema (i2b2) clinical data warehouse that enables flexible generation of modeling-ready cohorts as data frames.

Materials and methods

FIBER was developed on top of a large-scale star schema EHR database which contains data from 8 million patients and over 120 million encounters. To illustrate FIBER's capabilities, we present its application by building a heart surgery patient cohort with subsequent prediction of acute kidney injury (AKI) with various machine learning models.

Results

Using FIBER, we were able to build the heart surgery cohort (n = 12 061), identify the patients that developed AKI (n = 1005), and automatically extract relevant features (n = 774). Finally, we trained machine learning models that achieved area under the curve values of up to 0.77 for this exemplary use case.

Conclusion

FIBER is an open-source Python library developed for extracting information from star schema clinical data warehouses and reduces time-to-modeling, helping to streamline the clinical modeling process.

SUBMITTER: Datta S

PROVIDER: S-EPMC8327378 | biostudies-literature | 2021 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling.

Datta Suparno S Sachs Jan Philipp JP FreitasDa Cruz Harry H Martensen Tom T Bode Philipp P Morassi Sasso Ariane A Glicksberg Benjamin S BS Böttinger Erwin E

JAMIA open 20210701 3

<h4>Objectives</h4>The development of clinical predictive models hinges upon the availability of comprehensive clinical data. Tapping into such resources requires considerable effort from clinicians, data scientists, and engineers. Specifically, these efforts are focused on data extraction and preprocessing steps required prior to modeling, including complex database queries. A handful of software libraries exist that can reduce this complexity by building upon data standards. However, a gap rem ...[more]

PMID: 34350388

Similar Datasets

Project description:Healthcare facilities (HF) may identify catchment areas (CA) by selecting criteria that depend on various factors. These refer to hospital activities, geographical definition, patient covariates, and more. The analyses that were traditionally pursued have a limiting factor in the consideration of only static conditions. Instead, some of the CA determinants involve influences occurring at both temporal and spatial scales. The study of CA in the cancer context means choosing between HF, usually divided into general hospitals versus oncological centers (OCs). In the CA context, electronic health records (EHRs) promise to be a valuable source of information, one driving the next-generation patient-driven clinical decision support systems. Among the challenges, digital health requires the re-definition of a role of stochastic modeling to deal with emerging complexities from data heterogeneity. To model CA with cancer EHR, we have chosen a computational framework centered on a logistic model, as a reference, and on a multivariate statistical approach. We also provided a battery of tests for CA assessment. Our results indicate that a more refined CA model's structure yields superior discrimination power between health facilities. The increased significance was also visualized by comparative evaluations with ad hoc geo-localized maps. Notably, a cancer-specific spatial effect can be noticed, especially for breast cancer and through OCs. To mitigate the data distributional influences, bootstrap analysis was performed, and gains in some cancer-specific and spatially concentrated regions were obtained. Finally, when the temporal dynamics are assessed along a 3-year timeframe, negligible differential effects appear between predicted probabilities observed between standard critical values and bootstrapped values. In conclusion, for interpreting CA in terms of both spatial and temporal dynamics, sophisticated models are required. The one here proposed suggests that bootstrap can improve test accuracy. We recommend that evidences from stochastic modeling are merged with visual analytics, as this combination may be exploited by policy-makers in support to quantitative CA assessment.

Project description:ObjectiveTo optimally leverage the scalability and unique features of the electronic health records (EHR) for research that would ultimately improve patient care, we need to accurately identify patients and extract clinically meaningful measures. Using multiple sclerosis (MS) as a proof of principle, we showcased how to leverage routinely collected EHR data to identify patients with a complex neurological disorder and derive an important surrogate measure of disease severity heretofore only available in research settings.MethodsIn a cross-sectional observational study, 5,495 MS patients were identified from the EHR systems of two major referral hospitals using an algorithm that includes codified and narrative information extracted using natural language processing. In the subset of patients who receive neurological care at a MS Center where disease measures have been collected, we used routinely collected EHR data to extract two aggregate indicators of MS severity of clinical relevance multiple sclerosis severity score (MSSS) and brain parenchymal fraction (BPF, a measure of whole brain volume).ResultsThe EHR algorithm that identifies MS patients has an area under the curve of 0.958, 83% sensitivity, 92% positive predictive value, and 89% negative predictive value when a 95% specificity threshold is used. The correlation between EHR-derived and true MSSS has a mean R(2) = 0.38±0.05, and that between EHR-derived and true BPF has a mean R(2) = 0.22±0.08. To illustrate its clinical relevance, derived MSSS captures the expected difference in disease severity between relapsing-remitting and progressive MS patients after adjusting for sex, age of symptom onset and disease duration (p = 1.56×10(-12)).ConclusionIncorporation of sophisticated codified and narrative EHR data accurately identifies MS patients and provides estimation of a well-accepted indicator of MS severity that is widely used in research settings but not part of the routine medical records. Similar approaches could be applied to other complex neurological disorders.

Dataset Information

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling.

Objectives

Materials and methods

Results

Conclusion

Publications

FIBER: enabling flexible retrieval of electronic health records data for clinical predictive modeling.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets