Dataset Information

Building an OMOP common data model-compliant annotated corpus for COVID-19 clinical trials.

ABSTRACT: Clinical trials are essential for generating reliable medical evidence, but often suffer from expensive and delayed patient recruitment because the unstructured eligibility criteria description prevents automatic query generation for eligibility screening. In response to the COVID-19 pandemic, many trials have been created but their information is not computable. We included 700 COVID-19 trials available at the point of study and developed a semi-automatic approach to generate an annotated corpus for COVID-19 clinical trial eligibility criteria called COVIC. A hierarchical annotation schema based on the OMOP Common Data Model was developed to accommodate four levels of annotation granularity: i.e., study cohort, eligibility criteria, named entity and standard concept. In COVIC, 39 trials with more than one study cohorts were identified and labelled with an identifier for each cohort. 1,943 criteria for non-clinical characteristics such as "informed consent", "exclusivity of participation" were annotated. 9767 criteria were represented by 18,161 entities in 8 domains, 7,743 attributes of 7 attribute types and 16,443 relationships of 11 relationship types. 17,171 entities were mapped to standard medical concepts and 1,009 attributes were normalized into computable representations. COVIC can serve as a corpus indexed by semantic tags for COVID-19 trial search and analytics, and a benchmark for machine learning based criteria extraction.

SUBMITTER: Sun Y

PROVIDER: S-EPMC8079156 | biostudies-literature | 2021 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Building an OMOP common data model-compliant annotated corpus for COVID-19 clinical trials.

Sun Yingcheng Y Butler Alex A Stewart Latoya A LA Liu Hao H Yuan Chi C Southard Christopher T CT Kim Jae Hyun JH Weng Chunhua C

Journal of biomedical informatics 20210428

Clinical trials are essential for generating reliable medical evidence, but often suffer from expensive and delayed patient recruitment because the unstructured eligibility criteria description prevents automatic query generation for eligibility screening. In response to the COVID-19 pandemic, many trials have been created but their information is not computable. We included 700 COVID-19 trials available at the point of study and developed a semi-automatic approach to generate an annotated corpu ...[more]

PMID: 33887457

Similar Datasets

Project description:BackgroundKnowledge graphs (KGs) play a key role to enable explainable artificial intelligence (AI) applications in healthcare. Constructing clinical knowledge graphs (CKGs) against heterogeneous electronic health records (EHRs) has been desired by the research and healthcare AI communities. From the standardization perspective, community-based standards such as the Fast Healthcare Interoperability Resources (FHIR) and the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) are increasingly used to represent and standardize EHR data for clinical data analytics, however, the potential of such a standard on building CKG has not been well investigated.ObjectiveTo develop and evaluate methods and tools that expose the OMOP CDM-based clinical data repositories into virtual clinical KGs that are compliant with FHIR Resource Description Framework (RDF) specification.MethodsWe developed a system called FHIR-Ontop-OMOP to generate virtual clinical KGs from the OMOP relational databases. We leveraged an OMOP CDM-based Medical Information Mart for Intensive Care (MIMIC-III) data repository to evaluate the FHIR-Ontop-OMOP system in terms of the faithfulness of data transformation and the conformance of the generated CKGs to the FHIR RDF specification.ResultsA beta version of the system has been released. A total of more than 100 data element mappings from 11 OMOP CDM clinical data, health system and vocabulary tables were implemented in the system, covering 11 FHIR resources. The generated virtual CKG from MIMIC-III contains 46,520 instances of FHIR Patient, 716,595 instances of Condition, 1,063,525 instances of Procedure, 24,934,751 instances of MedicationStatement, 365,181,104 instances of Observations, and 4,779,672 instances of CodeableConcept. Patient counts identified by five pairs of SQL (over the MIMIC database) and SPARQL (over the virtual CKG) queries were identical, ensuring the faithfulness of the data transformation. Generated CKG in RDF triples for 100 patients were fully conformant with the FHIR RDF specification.ConclusionThe FHIR-Ontop-OMOP system can expose OMOP database as a FHIR-compliant RDF graph. It provides a meaningful use case demonstrating the potentials that can be enabled by the interoperability between FHIR and OMOP CDM. Generated clinical KGs in FHIR RDF provide a semantic foundation to enable explainable AI applications in healthcare.

Project description:ObjectiveThe COVID-19 pandemic has demonstrated the value of real-world data for public health research. International federated analyses are crucial for informing policy makers. Common data models (CDM) are critical for enabling these studies to be performed efficiently. Our objective was to convert the UK Biobank, a study of 500,000 participants with rich genetic and phenotypic data to the Observational Medical Outcomes Partnership (OMOP) CDM.Materials and methodsWe converted UK Biobank data to OMOP CDM v. 5.3. We transformedparticipant research data on diseases collected at recruitment and electronic health records (EHR) from primary care, hospitalizations, cancer registrations, and mortality from providers in England, Scotland, and Wales. We performed syntactic and semantic validations and compared comorbidities and risk factors between source and transformed data.ResultsWe identified 502,505 participants (3,086 with COVID-19) and transformed 690 fields (1,373,239,555 rows) to the OMOP CDM using eight different controlled clinical terminologies and bespoke mappings. Specifically, we transformed self-reported non-cancer illnesses 946,053 (83.91% of all source entries), cancers 37,802 (70.81%), medications 1,218,935 (88.25%), and prescriptions 864,788 (86.96%). In EHR, we transformed 1,3028,182 (99.95%) hospital diagnoses, 6,465,399 (89.2%) procedures, 337,896,333 primary care diagnoses (CTV3, SNOMED-CT), 139,966,587 (98.74%) prescriptions (dm+d) and 77,127 (99.95%) deaths (ICD-10). We observed good concordance across demographic, risk factor, and comorbidity factors between source and transformed data.Discussion and conclusionOur study demonstrated that the OMOP CDM can be successfully leveraged to harmonize complex large-scale biobanked studies combining rich multimodal phenotypic data. Our study uncovered several challenges when transforming data from questionnaires to the OMOP CDM which require further research. The transformed UK Biobank resource is a valuable tool that can enable federated research, like COVID-19 studies.

Project description:ObjectivesThe aim of this work is to demonstrate the use of a standardized health informatics framework to generate reliable and reproducible real-world evidence from Latin America and South Asia towards characterizing coronavirus disease 2019 (COVID-19) in the Global South.Materials and methodsPatient-level COVID-19 records collected in a patient self-reported notification system, hospital in-patient and out-patient records, and community diagnostic labs were harmonized to the Observational Medical Outcomes Partnership common data model and analyzed using a federated network analytics framework. Clinical characteristics of individuals tested for, diagnosed with or tested positive for, hospitalized with, admitted to intensive care unit with, or dying with COVID-19 were estimated.ResultsTwo COVID-19 databases covering 8.3 million people from Pakistan and 2.6 million people from Bahia, Brazil were analyzed. 109 504 (Pakistan) and 921 (Brazil) medical concepts were harmonized to Observational Medical Outcomes Partnership common data model. In total, 341 505 (4.1%) people in the Pakistan dataset and 1 312 832 (49.2%) people in the Brazilian dataset were tested for COVID-19 between January 1, 2020 and April 20, 2022, with a median [IQR] age of 36 [25, 76] and 38 (27, 50); 40.3% and 56.5% were female in Pakistan and Brazil, respectively. 1.2% percent individuals in the Pakistan dataset had Afghan ethnicity. In Brazil, 52.3% had mixed ethnicity. In agreement with international findings, COVID-19 outcomes were more severe in men, elderly, and those with underlying health conditions.ConclusionsCOVID-19 data from 2 large countries in the Global South were harmonized and analyzed using a standardized health informatics framework developed by an international community of health informaticians. This proof-of-concept study demonstrates a potential open science framework for global knowledge mobilization and clinical translation for timely response to healthcare needs in pandemics and beyond.

Dataset Information

Building an OMOP common data model-compliant annotated corpus for COVID-19 clinical trials.

Publications

Building an OMOP common data model-compliant annotated corpus for COVID-19 clinical trials.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets