Dataset Information

Toward automated assessment of health Web page quality using the DISCERN instrument.

ABSTRACT:

Background

As the Internet becomes the number one destination for obtaining health-related information, there is an increasing need to identify health Web pages that convey an accurate and current view of medical knowledge. In response, the research community has created multicriteria instruments for reliably assessing online medical information quality. One such instrument is DISCERN, which measures health Web page quality by assessing an array of features. In order to scale up use of the instrument, there is interest in automating the quality evaluation process by building machine learning (ML)-based DISCERN Web page classifiers.

Objective

The paper addresses 2 key issues that are essential before constructing automated DISCERN classifiers: (1) generation of a robust DISCERN training corpus useful for training classification algorithms, and (2) assessment of the usefulness of the current DISCERN scoring schema as a metric for evaluating the performance of these algorithms.

Methods

Using DISCERN, 272 Web pages discussing treatment options in breast cancer, arthritis, and depression were evaluated and rated by trained coders. First, different consensus models were compared to obtain a robust aggregated rating among the coders, suitable for a DISCERN ML training corpus. Second, a new DISCERN scoring criterion was proposed (features-based score) as an ML performance metric that is more reflective of the score distribution across different DISCERN quality criteria.

Results

First, we found that a probabilistic consensus model applied to the DISCERN instrument was robust against noise (random ratings) and superior to other approaches for building a training corpus. Second, we found that the established DISCERN scoring schema (overall score) is ill-suited to measure ML performance for automated classifiers.

Conclusion

Use of a probabilistic consensus model is advantageous for building a training corpus for the DISCERN instrument, and use of a features-based score is an appropriate ML metric for automated DISCERN classifiers.

Availability

The code for the probabilistic consensus model is available at https://bitbucket.org/A_2/em_dawid/ .

SUBMITTER: Allam A

PROVIDER: S-EPMC7651953 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:BACKGROUND:Patient-reported outcome (PRO) measures describe natural history, manage disease, and measure the effects of interventions in trials. Patients themselves increasingly use Web-based PRO tools to track their progress, share their data, and even self-experiment. However, existing PROs have limitations such as being: designed for paper (not screens), long and burdensome, negatively framed, under onerous licensing restrictions, either too generic or too specific. OBJECTIVE:This study aimed to develop and validate the core items of a modular, patient-centric, PRO system (Thrive) that could measure health status across a range of chronic conditions with minimal burden. METHODS:Thrive was developed in 4 phases, largely consistent with Food and Drug Administration guidance regarding PRO development. First, preliminary core items (common across multiple conditions: core Thrive items) were developed through literature review, analysis of approximately 20 existing PROs on PatientsLikeMe, and feedback from psychometric and content experts. Second, 2 rounds of cognitive interviews were iteratively conducted with patients (N=14) to obtain feedback on the preliminary items. Third, core Thrive items were administered electronically along with comparator measures, including 20-item Short-Form General Health Survey (SF)-20 and Patient Health Questionnaire (PHQ)-9, to a large sample (N=2002) of adults with chronic diseases through the PatientsLikeMe platform. On the basis of theoretical and empirical rationale, items were revised or removed. Fourth, the revised core Thrive items were administered to another sample of patients (N=704) with generic and condition-specific comparator measures. A psychometric evaluation, which included both modern and classical test theory approaches, was conducted on these items, and several more items were removed. RESULTS:Cognitive interviews helped to remove confusing or redundant items. Empirical testing of subscales revealed good internal consistency (Cronbach alpha=.712-.879), test-retest reliability (absolute intraclass correlations=.749-.912), and convergent validity with legacy PRO scales (eg, Pearson r=.5-.75 between Thrive subscales and PHQ-9 total). The finalized instrument consists of a 19-item core including 5 multi-item subscales: Core symptoms, Abilities, Mobility, Sleep, and Thriving. Results provide evidence of construct (content, convergent) validity, high levels of test-retest and internal consistency reliability, and the ability to detect change over time. The items did not exhibit bias based on gender or age, and the items generally functioned similarly across conditions. These results support the use of Thrive Core items across diverse chronic patient populations. CONCLUSIONS:Thrive appears to be a useful approach for capturing important domains for patients with chronic conditions. This core set serves as a foundation to begin developing modular condition-specific versions in the near future. Cross-walking against traditional PROs from the PatientsLikeMe platform is underway, in addition to clinical validation and comparison with biomarkers. Thrive is licensed under Creative Commons Attribution ShareAlike 4.0.

Project description:IntroductionHealth information is a prerequisite of informed decision-making. Criteria for development, content and presentation have recently been published in a corresponding guideline. Within a systematic search, 27 relevant checklists were identified, none of them, however, complying with the guideline or providing reasonably operationalised measurement items. Therefore, a draft of a checklist with 19 criteria was drafted. The current study aims at developing and validating this measure of quality.Methods and analysisThe validation design consists of five single studies to be conducted at the University of Halle-Wittenberg/Germany and Graz/Austria. (1) Achieving content validity through expert reviews of the first draft, (2) achieving feasibility using 'think aloud' in piloting with untrained users, (3) pretesting the instrument applied to health information materials without use of secondary sources: determining inter-rater reliability and criterion validity, (4) determining construct validity using information on proceedings and methods in the development process provided by the developers and (5) determining divergent validity in comparison with the Ensuring Quality Information for Patients (EQUIP) (expanded) Scale. The substudies will use varying samples of experts, students and developers and will apply the instrument to materials of various domains. Sample sizes will be adjusted to the particular research designs and questions. Analyses will employ qualitative methods, such as content analyses and discourse within the expert panel, and correlation-based methods both for determining inter-rater reliability and validity.Ethics and disseminationThe project is approved by the ethics committee of the Martin Luther University Halle-Wittenberg (approval number: 2019 115). Results will be published, and the instrument made accessible on public health platforms. It is meant to become a certification standard. MAPPinfo can be used as a screening instrument without training or secondary sources. Although developed in the German language, the instrument will be applicable also in other languages.Trial registration numberAsPredected22546; date of registration: 24 July 2019.Protocol versionJuly 2020.

Project description:BackgroundClinical practice guidelines (CPGs) are representative methods for promoting the standardization of healthcare and improvement of its quality. Few studies have investigated changes in the quality of CPGs published in a country over time. Our aim was to investigate changes in the quality of CPGs over time in the context of the available infrastructure for CPG development, public interest in healthcare quality, and healthcare providers' responses to this interest.MethodsAll CPGs pertaining to evidence-based medicine (EBM) issued between 2000 and 2014 in Japan (n = 373) were evaluated using the Japanese version of the Appraisal of Guidelines for Research and Evaluation (AGREE) I. Additionally, time trends in quality were analyzed. Using a cut-off point based on the publication year of CPG development literature, the evaluated CPGs were classified into those published until 2008 (pre-2008) and those published since 2009 (post-2008). Subsequently, we compared these groups in terms of 1) first edition CPGs and its second editions, and 2) patients' version of CPGs.ResultsScores on all six domains of AGREE I improved each year. A comparison of the first- and second-edition of CPGs (n = 64) showed that scores on all domains improved significantly after revision. Significant improvement was observed in three domains (#2 stakeholder involvement, #3 rigor of development, and #4 clarity of presentation) in the pre-2008 group and in all domains in the post-2008 group. The comparison between the pre- and post-2008 groups in terms of CPGs for patients showed that the score increased in only one domain (#1 scope and purpose).ConclusionsThe number of published CPGs has been increasing and the quality of CPGs, as assessed using the AGREE I instrument, has been improving. These changes seem to be influenced by improvements in social infrastructure, such as the publication of CPG development procedures, availability of CPG preparation methodology training, and increase in CPG-related skills.

Project description:Introduction:Electronic health record (EHR)-driven phenotyping is a critical first step in generating biomedical knowledge from EHR data. Despite recent progress, current phenotyping approaches are manual, time-consuming, error-prone, and platform-specific. This results in duplication of effort and highly variable results across systems and institutions, and is not scalable or portable. In this work, we investigate how the nascent Clinical Quality Language (CQL) can address these issues and enable high-throughput, cross-platform phenotyping. Methods:We selected a clinically validated heart failure (HF) phenotype definition and translated it into CQL, then developed a CQL execution engine to integrate with the Observational Health Data Sciences and Informatics (OHDSI) platform. We executed the phenotype definition at two large academic medical centers, Northwestern Medicine and Weill Cornell Medicine, and conducted results verification (n = 100) to determine precision and recall. We additionally executed the same phenotype definition against two different data platforms, OHDSI and Fast Healthcare Interoperability Resources (FHIR), using the same underlying dataset and compared the results. Results:CQL is expressive enough to represent the HF phenotype definition, including Boolean and aggregate operators, and temporal relationships between data elements. The language design also enabled the implementation of a custom execution engine with relative ease, and results verification at both sites revealed that precision and recall were both 100%. Cross-platform execution resulted in identical patient cohorts generated by both data platforms. Conclusions:CQL supports the representation of arbitrarily complex phenotype definitions, and our execution engine implementation demonstrated cross-platform execution against two widely used clinical data platforms. The language thus has the potential to help address current limitations with portability in EHR-driven phenotyping and scale in learning health systems.

Project description:IntroductionThe Semantic Web community provides a common Resource Description Framework (RDF) that allows representation of resources such that they can be linked. To maximize the potential of linked data - machine-actionable interlinked resources on the Web - a certain level of quality of RDF resources should be established, particularly in the biomedical domain in which concepts are complex and high-quality biomedical ontologies are in high demand. However, it is unclear which quality metrics for RDF resources exist that can be automated, which is required given the multitude of RDF resources. Therefore, we aim to determine these metrics and demonstrate an automated approach to assess such metrics of RDF resources.MethodsAn initial set of metrics are identified through literature, standards, and existing tooling. Of these, metrics are selected that fulfil these criteria: (1) objective; (2) automatable; and (3) foundational. Selected metrics are represented in RDF and semantically aligned to existing standards. These metrics are then implemented in an open-source tool. To demonstrate the tool, eight commonly used RDF resources were assessed, including data models in the healthcare domain (HL7 RIM, HL7 FHIR, CDISC CDASH), ontologies (DCT, SIO, FOAF, ORDO), and a metadata profile (GRDDL).ResultsSix objective metrics are identified in 3 categories: Resolvability (1), Parsability (1), and Consistency (4), and represented in RDF. The tool demonstrates that these metrics can be automated, and application in the healthcare domain shows non-resolvable URIs (ranging from 0.3% to 97%) among all eight resources and undefined URIs in HL7 RIM, and FHIR. In the tested resources no errors were found for parsability and the other three consistency metrics for correct usage of classes and properties.ConclusionWe extracted six objective and automatable metrics from literature, as the foundational quality requirements of RDF resources to maximize the potential of linked data. Automated tooling to assess resources has shown to be effective to identify quality issues that must be avoided. This approach can be expanded to incorporate more automatable metrics so as to reflect additional quality dimensions with the assessment tool implementing more metrics.