Dataset Information

Methods of assessing categorical agreement between correlated screening tests in clinical studies.

ABSTRACT: Advances in breast imaging and other screening tests have prompted studies to evaluate and compare the consistency between experts' ratings of existing with new screening tests. In clinical settings, medical experts make subjective assessments of screening test results such as mammograms. Consistency between experts' ratings is evaluated by measures of inter-rater agreement or association. However, conventional measures, such as Cohen's and Fleiss' kappas, are unable to be applied or may perform poorly when studies consist of many experts, unbalanced data, or dependencies between experts' ratings exist. Here we assess the performance of existing approaches including recently developed summary measures for assessing the agreement between experts' binary and ordinal ratings when patients undergo two screening procedures. Methods to assess consistency between repeated measurements by the same experts are also described. We present applications to three large-scale clinical screening studies. Properties of these agreement measures are illustrated via simulation studies. Generally, a model-based approach provides several advantages over alternative methods including the ability to flexibly incorporate various measurement scales (i.e. binary or ordinal), large numbers of experts and patients, sparse data, and robustness to prevalence of underlying disease.

SUBMITTER: Zhou TJ

PROVIDER: S-EPMC8299998 | biostudies-literature | 2021

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Methods of assessing categorical agreement between correlated screening tests in clinical studies.

Zhou Thomas J TJ Raza Sughra S Nelson Kerrie P KP

Journal of applied statistics 20200609 10

Advances in breast imaging and other screening tests have prompted studies to evaluate and compare the consistency between experts' ratings of existing with new screening tests. In clinical settings, medical experts make subjective assessments of screening test results such as mammograms. Consistency between experts' ratings is evaluated by measures of inter-rater agreement or association. However, conventional measures, such as Cohen's and Fleiss' kappas, are unable to be applied or may perform ...[more]

PMID: 34305250

Similar Datasets

Project description:BackgroundSeveral criteria have been used to assess agreement between replicate slide readings of malaria parasite density. Such criteria may be based on percent difference, or absolute difference, or a combination. Neither the rationale for choosing between these types of criteria, nor that for choosing the magnitude of difference which defines acceptable agreement, are clear. The current paper seeks a procedure which avoids the disadvantages of these current options and whose parameter values are more clearly justified.Methods and resultsVariation of parasite density within a slide is expected, even when it has been prepared from a homogeneous sample. This places lower limits on sensitivity and observer agreement, quantified by the Poisson distribution. This means that, if a criterion of fixed percent difference criterion is used for satisfactory agreement, the number of discrepant readings is over-estimated at low parasite densities. With a criterion of fixed absolute difference, the same happens at high parasite densities. For an ideal slide, following the Poisson distribution, a criterion based on a constant difference in square root counts would apply for all densities. This can be back-transformed to a difference in absolute counts, which, as expected, gives a wider range of acceptable agreement at higher average densities. In an example dataset from Tanzania, observed differences in square root counts correspond to a 95% limits of agreement of -2,800 and +2,500 parasites/microl at average density of 2,000 parasites/microl, and -6,200 and +5,700 parasites/microl at 10,000 parasites/microl. However, there were more outliers beyond those ranges at higher densities, meaning that actual coverage of these ranges was not a constant 95%, but decreased with density. In a second study, a trial of microscopist training, the corresponding ranges of agreement are wider and asymmetrical: -8,600 to +5,200/microl, and -19,200 to +11,700/microl, respectively. By comparison, the optimal limits of agreement, corresponding to Poisson variation, are +/- 780 and +/- 1,800 parasites/microl, respectively. The focus of this approach on the volume of blood read leads to other conclusions. For example, no matter how large a volume of blood is read, some densities are too low to be reliably detected, which in turn means that disagreements on slide positivity may simply result from within-slide variation, rather than reading errors.ConclusionsThe proposed method defines limits of acceptable agreement in a way which allows for the natural increase in variability with parasite density. This includes defining the levels of between-reader variability, which are consistent with random variation: disagreements within these limits should not trigger additional readings. This approach merits investigation in other settings, in order to determine both the extent of its applicability, and appropriate numerical values for limits of agreement.

Dataset Information

Methods of assessing categorical agreement between correlated screening tests in clinical studies.

Publications

Methods of assessing categorical agreement between correlated screening tests in clinical studies.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets