Dataset Information

Does training improve diagnostic accuracy and inter-rater agreement in applying the Berlin radiographic definition of acute respiratory distress syndrome? A multicenter prospective study.

ABSTRACT:

Background

Poor inter-rater reliability in chest radiograph interpretation has been reported in the context of acute respiratory distress syndrome (ARDS), although not for the Berlin definition of ARDS. We sought to examine the effect of training material on the accuracy and consistency of intensivists' chest radiograph interpretations for ARDS diagnosis.

Methods

We conducted a rater agreement study in which 286 intensivists (residents 41.3%, junior attending physicians 35.3%, and senior attending physician 23.4%) independently reviewed the same 12 chest radiographs developed by the ARDS Definition Task Force ("the panel") before and after training. Radiographic diagnoses by the panel were classified into the consistent (n = 4), equivocal (n = 4), and inconsistent (n = 4) categories and were used as a reference. The 1.5-hour training course attended by all 286 intensivists included introduction of the diagnostic rationale, and a subsequent in-depth discussion to reach consensus for all 12 radiographs.

Results

Overall diagnostic accuracy, which was defined as the percentage of chest radiographs that were interpreted correctly, improved but remained poor after training (42.0 ± 14.8% before training vs. 55.3 ± 23.4% after training, p < 0.001). Diagnostic sensitivity and specificity improved after training for all diagnostic categories (p < 0.001), with the exception of specificity for the equivocal category (p = 0.883). Diagnostic accuracy was higher for the consistent category than for the inconsistent and equivocal categories (p < 0.001). Comparisons of pre-training and post-training results revealed that inter-rater agreement was poor and did not improve after training, as assessed by overall agreement (0.450 ± 0.406 vs. 0.461 ± 0.575, p = 0.792), Fleiss's kappa (0.133 ± 0.575 vs. 0.178 ± 0.710, p = 0.405), and intraclass correlation coefficient (ICC; 0.219 vs. 0.276, p = 0.470).

Conclusions

The radiographic diagnostic accuracy and inter-rater agreement were poor when the Berlin radiographic definition was used, and were not significantly improved by the training set of chest radiographs developed by the ARDS Definition Task Force.

Trial registration

The study was registered at ClinicalTrials.gov (registration number NCT01704066 ) on 6 October 2012.

SUBMITTER: Peng JM

PROVIDER: S-EPMC5251343 | biostudies-literature | 2017 Jan

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Does training improve diagnostic accuracy and inter-rater agreement in applying the Berlin radiographic definition of acute respiratory distress syndrome? A multicenter prospective study.

Peng Jin-Min JM Qian Chuan-Yun CY Yu Xiang-You XY Zhao Ming-Yan MY Li Shu-Sheng SS Ma Xiao-Chun XC Kang Yan Y Zhou Fa-Chun FC He Zhen-Yang ZY Qin Tie-He TH Yin Yong-Jie YJ Jiang Li L Hu Zhen-Jie ZJ Sun Ren-Hua RH Lin Jian-Dong JD Li Tong T Wu Da-Wei DW An You-Zhong YZ Ai Yu-Hang YH Zhou Li-Hua LH Cao Xiang-Yuan XY Zhang Xi-Jing XJ Sun Rong-Qing RQ Chen Er-Zhen EZ Du Bin B

Critical care (London, England) 20170120 1

<h4>Background</h4>Poor inter-rater reliability in chest radiograph interpretation has been reported in the context of acute respiratory distress syndrome (ARDS), although not for the Berlin definition of ARDS. We sought to examine the effect of training material on the accuracy and consistency of intensivists' chest radiograph interpretations for ARDS diagnosis.<h4>Methods</h4>We conducted a rater agreement study in which 286 intensivists (residents 41.3%, junior attending physicians 35.3%, and ...[more]

PMID: 28107822

Similar Datasets

Project description:Objectives To explore agreement among healthcare professionals assessing eligibility for work disability benefits.Design Systematic review and narrative synthesis of reproducibility studies.Data sources Medline, Embase, and PsycINFO searched up to 16 March 2016, without language restrictions, and review of bibliographies of included studies.Eligibility criteria Observational studies investigating reproducibility among healthcare professionals performing disability evaluations using a global rating of working capacity and reporting inter-rater reliability by a statistical measure or descriptively. Studies could be conducted in insurance settings, where decisions on ability to work include normative judgments based on legal considerations, or in research settings, where decisions on ability to work disregard normative considerations. : Teams of paired reviewers identified eligible studies, appraised their methodological quality and generalisability, and abstracted results with pretested forms. As heterogeneity of research designs and findings impeded a quantitative analysis, a descriptive synthesis stratified by setting (insurance or research) was performed.Results From 4562 references, 101 full text articles were reviewed. Of these, 16 studies conducted in an insurance setting and seven in a research setting, performed in 12 countries, met the inclusion criteria. Studies in the insurance setting were conducted with medical experts assessing claimants who were actual disability claimants or played by actors, hypothetical cases, or short written scenarios. Conditions were mental (n=6, 38%), musculoskeletal (n=4, 25%), or mixed (n=6, 38%). Applicability of findings from studies conducted in an insurance setting to real life evaluations ranged from generalisable (n=7, 44%) and probably generalisable (n=3, 19%) to probably not generalisable (n=6, 37%). Median inter-rater reliability among experts was 0.45 (range intraclass correlation coefficient 0.86 to κ-0.10). Inter-rater reliability was poor in six studies (37%) and excellent in only two (13%). This contrasts with studies conducted in the research setting, where the median inter-rater reliability was 0.76 (range 0.91-0.53), and 71% (5/7) studies achieved excellent inter-rater reliability. Reliability between assessing professionals was higher when the evaluation was guided by a standardised instrument (23 studies, P=0.006). No such association was detected for subjective or chronic health conditions or the studies' generalisability to real world evaluation of disability (P=0.46, 0.45, and 0.65, respectively).Conclusions Despite their common use and far reaching consequences for workers claiming disabling injury or illness, research on the reliability of medical evaluations of disability for work is limited and indicates high variation in judgments among assessing professionals. Standardising the evaluation process could improve reliability. Development and testing of instruments and structured approaches to improve reliability in evaluation of disability are urgently needed.

Project description:Background: There has been a groundswell of national support for transparent tracking and dissemination of PhD career outcomes. In 2017, individuals from multiple institutions and professional organizations met to create the Unified Career Outcomes Taxonomy (UCOT 2017), a three-tiered taxonomy to help institutions uniformly classify career outcomes of PhD graduates. Early adopters of UCOT 2017, noted ambiguity in some categories of the career taxonomy, raising questions about its consistent application within and across institutions. Methods: To test and evaluate the consistency of UCOT 2017, we calculated inter-rater reliability across two rounds of iterative refinement of the career taxonomy, classifying over 800 PhD alumni records via nine coders. Results: We identified areas of discordance in the taxonomy, and progressively refined UCOT 2017 and an accompanying Guidance Document to improve inter-rater reliability across all three tiers of the career taxonomy. However, differing interpretations of the classifications, especially for faculty classifications in the third tier, resulted in continued discordance among the coders. We addressed this discordance with clarifying language in the Guidance Document, and proposed the addition of a flag system for identification of the title, rank, and prefix of faculty members. This labeling system provides the additional benefit of highlighting the granularity and the intersectionality of faculty job functions, while maintaining the ability to sort by - and report data on - faculty and postdoctoral trainee roles, as is required by some national and federal reporting guidelines. We provide specific crosswalk guidance for how a user may choose to incorporate our suggestions while maintaining the ability to report in accordance with UCOT 2017. Conclusions: Our findings underscore the importance of detailed guidance documents, coder training, and periodic collaborative review of career outcomes taxonomies as PhD careers evolve in the global workforce. Implications for coder-training and use of novice coders are also discussed.

Project description:Objectives: It has been recommended that clinical trials of Traditional Chinese Medicine (TCM) would be more ecologically valid if its characteristic mode of diagnostic reasoning were integrated into their design. In that context, however, it is also widely held that demonstrating a high level of agreement on initial TCM diagnoses is necessary for the replicability that the biomedical paradigm requires for the conclusions from such trials. Our aim was to review, summarize, and critique quantitative experimental studies of inter-rater agreement in TCM, and some of their underlying assumptions. Design: Systematic electronic searches were conducted for articles that reported a quantitative measure of inter-rater agreement across a number of rating choices based on examinations of human subjects in person by TCM practitioners, and published in English language peer-reviewed journals. Publications in languages other than English were not included, nor those appearing in other than peer-reviewed journals. Predefined categories of information were extracted from full texts by two investigators working independently. Each article was scored for methodological quality. Outcome measures: Design features across all studies and levels of inter-rater agreement across studies that reported the same type of outcome statistic were compared. Results: Twenty-one articles met inclusion criteria. Fourteen assessed inter-rater agreement on TCM diagnoses, two on diagnostic signs found upon traditional TCM examination, and five on novel rating schemes derived from TCM theory and practice. Raters were students of TCM colleges or graduates of TCM training programs with 3 or more years experience and licensure. Type of outcome statistic varied. Mean rates of pairwise agreement averaged 57% (median 65, range 19-96) across the 9 studies reporting them. Mean Cohen's kappa averaged 0.34 (median 0.34, range 0.07-0.59) across the seven studies reporting them. Meta-analysis was not possible due to variations in study design and outcome statistics. High risks of bias and confounding, and deficits in statistical reporting were common. Conclusions: With a few exceptions, the levels of agreement were low to moderate. Most studies had significant deficits of both methodology and reporting. Results overall suggest a few design features that might contribute to higher levels of agreement. These should be studied further with better experimental controls and more thorough reporting of outcomes. In addition, methods of complex systems analysis should be explored to more adequately model the relationship between clinical outcomes, and the series of diagnoses and treatments that are the norm in actual TCM practice.

Dataset Information

Does training improve diagnostic accuracy and inter-rater agreement in applying the Berlin radiographic definition of acute respiratory distress syndrome? A multicenter prospective study.

Background

Methods

Results

Conclusions

Trial registration

Publications

Does training improve diagnostic accuracy and inter-rater agreement in applying the Berlin radiographic definition of acute respiratory distress syndrome? A multicenter prospective study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets