Dataset Information

A paired kappa to compare binary ratings across two medical tests.

ABSTRACT: Agreement between experts' ratings is an important prerequisite for an effective screening procedure. In clinical settings, large-scale studies are often conducted to compare the agreement of experts' ratings between new and existing medical tests, for example, digital versus film mammography. Challenges arise in these studies where many experts rate the same sample of patients undergoing two medical tests, leading to a complex correlation structure between experts' ratings. Here, we propose a novel paired kappa measure to compare the agreement between the binary ratings of many experts across two medical tests. Existing approaches can accommodate only a small number of experts, rely heavily on Cohen's kappa and Scott's pi measures of agreement, and thus are prone to their drawbacks. The proposed kappa appropriately accounts for correlations between ratings due to patient characteristics, corrects for agreement due to chance, and is robust to disease prevalence and other flaws inherent in the use of Cohen's kappa. It can be easily calculated in the software package R. In contrast to existing approaches, the proposed measure can flexibly incorporate large numbers of experts and patients by utilizing the generalized linear mixed models framework. It is intended to be used in population-based studies, increasing efficiency without increasing modeling complexity. Extensive simulation studies demonstrate low bias and excellent coverage probability of the proposed kappa under a broad range of conditions. Methods are applied to a recent nationwide breast cancer screening study comparing film mammography to digital mammography.

SUBMITTER: Nelson KP

PROVIDER: S-EPMC6884009 | biostudies-literature | 2019 Jul

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

A paired kappa to compare binary ratings across two medical tests.

Nelson Kerrie P KP Edwards Don D

Statistics in medicine 20190517 17

Agreement between experts' ratings is an important prerequisite for an effective screening procedure. In clinical settings, large-scale studies are often conducted to compare the agreement of experts' ratings between new and existing medical tests, for example, digital versus film mammography. Challenges arise in these studies where many experts rate the same sample of patients undergoing two medical tests, leading to a complex correlation structure between experts' ratings. Here, we propose a n ...[more]

PMID: 31099902

Similar Datasets

Project description:BackgroundThe comparison of the performance of two binary diagnostic tests is an important topic in Clinical Medicine. The most frequent type of sample design to compare two binary diagnostic tests is the paired design. This design consists of applying the two binary diagnostic tests to all of the individuals in a random sample, where the disease status of each individual is known through the application of a gold standard. This article presents an R program to compare parameters of two binary tests subject to a paired design.ResultsThe "compbdt" program estimates the sensitivity and the specificity, the likelihood ratios and the predictive values of each diagnostic test applying the confidence intervals with the best asymptotic performance. The program compares the sensitivities and specificities of the two diagnostic tests simultaneously, as well as the likelihood ratios and the predictive values, applying the global hypothesis tests with the best performance in terms of type I error and power. When the global hypothesis test is significant, the causes of the significance are investigated solving the individual hypothesis tests and applying the multiple comparison method of Holm. The most optimal confidence intervals are also calculated for the difference or ratio between the respective parameters. Based on the data observed in the sample, the program also estimates the probability of making a type II error if the null hypothesis is not rejected, or estimates the power if the if the alternative hypothesis is accepted. The "compbdt" program provides all the necessary results so that the researcher can easily interpret them. The estimation of the probability of making a type II error allows the researcher to decide about the reliability of the null hypothesis when this hypothesis is not rejected. The "compbdt" program has been applied to a real example on the diagnosis of coronary artery disease.ConclusionsThe "compbdt" program is one which is easy to use and allows the researcher to compare the most important parameters of two binary tests subject to a paired design. The "compbdt" program is available as supplementary material.

Project description:Diagnostic tests are frequently reliant upon the interpretation of images by skilled raters. In many clinical settings, however, the variability observed between experts' ratings plays a detrimental role in the degree of confidence in these interpretations, leading to uncertainty in the diagnostic process. For example, in breast cancer testing, radiologists interpret mammographic images, while breast biopsy results are examined by pathologists. Each of these procedures involves elements of subjectivity. We propose here a flexible two-stage Bayesian latent variable model to investigate how the skills of individual raters impact the diagnostic accuracy of image-related testing in large-scale medical testing studies. A strength of the proposed model is that the true disease status of a patient within a reasonable time frame may or may not be known. In these studies, many raters each contribute classifications on a large sample of patients using a defined ordinal grading scale, leading to a complex correlation structure between ratings. Our modeling approach considers the different sources of variability contributed by experts and patients while accounting for correlations present between ratings and patients, in contrast to currently available methods. We propose a novel measure of a rater's ability (magnifier) that, in contrast to conventional measures of sensitivity and specificity, is robust to the underlying prevalence of disease in the population, providing an alternative measure of diagnostic accuracy across patient populations. Extensive simulation studies demonstrate lower bias in estimation of parameters and measures of accuracy, and illustrate outperformance of the proposed model when compared with existing models. Receiver operator characteristic curves are derived to assess the diagnostic accuracy of individual experts and their overall performance. Our proposed modeling approach is applied to a large breast imaging study for known disease status and a uterine cancer dataset for unknown disease status.

Dataset Information

A paired kappa to compare binary ratings across two medical tests.

Publications

A paired kappa to compare binary ratings across two medical tests.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets