Dataset Information

Automated detection of poor-quality data: case studies in healthcare.

ABSTRACT: The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable Data Cleansing. This method is shown to have numerous benefits including protection of private patient data; improvement in AI generalizability; reduction in time, cost, and data needed for training; all while offering a truer reporting of AI performance itself. Additionally, results show that Untrainable Data Cleansing could be useful as a triage tool to identify difficult clinical cases that may warrant in-depth evaluation or additional testing to support a diagnosis.

SUBMITTER: Dakka MA

PROVIDER: S-EPMC8429593 | biostudies-literature | 2021 Sep

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Automated detection of poor-quality data: case studies in healthcare.

Dakka M A MA Nguyen T V TV Hall J M M JMM Diakiw S M SM VerMilyea M M Linke R R Perugini M M Perugini D D

Scientific reports 20210909 1

The detection and removal of poor-quality data in a training set is crucial to achieve high-performing AI models. In healthcare, data can be inherently poor-quality due to uncertainty or subjectivity, but as is often the case, the requirement for data privacy restricts AI practitioners from accessing raw training data, meaning manual visual verification of private patient data is not possible. Here we describe a novel method for automated identification of poor-quality data, called Untrainable D ...[more]

PMID: 34504205

Similar Datasets

Project description:Ventilator-associated pneumonia (VAP) is a frequent complication of mechanical ventilation and is associated with substantial morbidity and mortality. Accurate diagnosis of VAP relies in part on subjective diagnostic criteria. Surveillance according to ventilator-associated event (VAE) criteria may allow quick and objective benchmarking. Our objective was to create an automated surveillance tool for VAE tiers I and II on a large data collection, evaluate its diagnostic accuracy and retrospectively determine the yearly baseline VAE incidence. We included all consecutive intensive care unit admissions of patients with mechanical ventilation at Bern University Hospital, a tertiary referral center, from January 2008 to July 2016. Data was automatically extracted from the patient data management system and automatically processed. We created and implemented an application able to automatically analyze respiratory and relevant medication data according to the Centers for Disease Control protocol for VAE-surveillance. In a subset of patients, we compared the accuracy of automated VAE surveillance according to CDC criteria to a gold standard (a composite of automated and manual evaluation with mediation for discrepancies) and evaluated the evolution of the baseline incidence. The study included 22'442 ventilated admissions with a total of 37'221 ventilator days. 592 ventilator-associated events (tier I) occurred; of these 194 (34%) were of potentially infectious origin (tier II). In our validation sample, automated surveillance had a sensitivity of 98% and specificity of 100% in detecting VAE compared to the gold standard. The yearly VAE incidence rate ranged from 10.1-22.1 per 1000 device days and trend showed a decrease in the yearly incidence rate ratio of 0.96 (95% CI, 0.93-1.00, p = 0.03). This study demonstrated that automated VAE detection is feasible, accurate and reliable and may be applied on a large, retrospective sample and provided insight into long-term institutional VAE incidences. The surveillance tool can be extended to other centres and provides VAE incidences for performing quality control and intervention studies.

Project description:BackgroundThe American Society for Clinical Oncology recently launched the minimal common oncology data elements project to facilitate cancer data interoperability. However, clinical data are often unrecorded in an organized way, and converting them into a structured format can be time-consuming. Clinical Data Warehouse (CDW) is a database that consolidates data from different clinical sources. However, the clinical data extracted from this database include not only structured data but also natural language generated during clinical practice. Therefore, applying these data to a clinical study is challenging because they are unstructured, and unformatted to allow essential content to be found. This study determined how best to organize a huge amount of clinical data to evaluate the upper aerodigestive tract cancers' clinical features and outcomes, including cancer of the head and neck, esophagus, lung, thymus, and mesothelioma.MethodsThe Real-time autOmatically updated data warehOuse in healThcare (ROOT) uses six main regions to describe the journey of cancer patients. This study, developed an algorithm optimized for each disease category using natural language processing of unstructured data and data capture of structured data. Data from patients diagnosed at the Samsung Medical Center from 2008-2020 were used.ResultsComprehensive clinical data for 67,617 patients across six tumor types: 28,954 with non-small-cell lung cancer, 2,540 with small-cell lung cancer, 30,035 with head and neck cancer, 4,950 with esophageal cancer, 966 with thymic cancer, and 172 with mesothelioma were collected. Additionally, the results of a longitudinal molecular study, including epidermal growth factor receptor (EGFR) mutations, anaplastic lymphoma kinase (ALK) tests, and next-generation sequencing (NGS), were included. Scattered information was integrated and automatically built up to match the cohort, allowing users to capture the most updated test results and treatment outcomes.ConclusionsThis landmark study documented the successful construction of a real-time updating system for medical big data, based on the CDW program.

Dataset Information

Automated detection of poor-quality data: case studies in healthcare.

Publications

Automated detection of poor-quality data: case studies in healthcare.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets