Dataset Information

Retrospective comparison of approaches to evaluating inter-observer variability in CT tumour measurements in an academic health centre.

ABSTRACT:

Background

A growing number of research studies have reported inter-observer variability in sizes of tumours measured from CT scans. It remains unclear whether the conventional statistical measures correctly evaluate the CT measurement consistency for optimal treatment management and decision-making. We compared and evaluated the existing measures for evaluating inter-observer variability in CT measurement of cancer lesions.

Methods

13 board-certified radiologists repeatedly reviewed 10 CT image sets of lung lesions and hepatic metastases selected through a randomisation process. A total of 130 measurements under RECIST 1.1 (Response Evaluation Criteria in Solid Tumors) guidelines were collected for the demonstration. Intraclass correlation coefficient (ICC), Bland-Altman plotting and outlier counting methods were selected for the comparison. The each selected measure was used to evaluate three cases with observed, increased and decreased inter-observer variability.

Results

The ICC score yielded a weak detection when evaluating different levels of the inter-observer variability among radiologists (increased: 0.912; observed: 0.962; decreased: 0.990). The outlier counting method using Bland-Altman plotting with 2SD yielded no detection at all with its number of outliers unchanging regardless of level of inter-observer variability. Outlier counting based on domain knowledge was more sensitised to different levels of the inter-observer variability compared with the conventional measures (increased: 0.756; observed: 0.923; improved: 1.000). Visualisation of pairwise Bland-Altman bias was also sensitised to the inter-observer variability with its pattern rapidly changing in response to different levels of the inter-observer variability.

Conclusions

Conventional measures may yield weak or no detection when evaluating different levels of the inter-observer variability among radiologists. We observed that the outlier counting based on domain knowledge was sensitised to the inter-observer variability in CT measurement of cancer lesions. Our study demonstrated that, under certain circumstances, the use of standard statistical correlation coefficients may be misleading and result in a sense of false security related to the consistency of measurement for optimal treatment management and decision-making.

SUBMITTER: Woo M

PROVIDER: S-EPMC7668356 | biostudies-literature | 2020 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Retrospective comparison of approaches to evaluating inter-observer variability in CT tumour measurements in an academic health centre.

Woo MinJae M Heo Moonseong M Devane A Michael AM Lowe Steven C SC Gimbel Ronald W RW

BMJ open 20201114 11

<h4>Background</h4>A growing number of research studies have reported inter-observer variability in sizes of tumours measured from CT scans. It remains unclear whether the conventional statistical measures correctly evaluate the CT measurement consistency for optimal treatment management and decision-making. We compared and evaluated the existing measures for evaluating inter-observer variability in CT measurement of cancer lesions.<h4>Methods</h4>13 board-certified radiologists repeatedly revie ...[more]

PMID: 33191265

Similar Datasets

Project description:ObjectivesTo investigate the interdisciplinary interobserver reproducibility of Hertel-exophthalmometry-like protrusion measurements on multidetector-row-computed-tomography- (MDCT-) images of the orbit to facilitate structured evaluation of the orbit and mid-face.MethodsRespective reproducibility of base-length along the interfronto-zygomatic line, right and left ocular protrusion, and deriving interocular difference was measured in this retrospective (04/2009-03/2020) single-centre observational study. MDCT-series and slice-positions were selected independently, using picture-archiving-and-communication-system- (PACS-) tools on tilt-corrected axial MDCT-images (slice-thickness 0.6-3.0 mm, window/centre 350/50 HU) in 37 selected adult patients (24 female, age 57 ± 13 years, average±standard-deviation) with clinical indication for Hertel-exophthalmometry, by one radiology-attending, two ophthalmology-attendings, one critical-care-attending, and one ear-nose-throat-surgery resident, respectively. Bland-Altman plots and Wilcoxon-matched-pairs-signed-rank-tests compared interobserver results.ResultsMean and median interobserver and intraobserver (radiology-attending) deviations were within 1 mm of respective averages of base-length (98 ± 4 mm), right and left ocular protrusion (21 ± 4 mm) and interocular difference (2 ± 1 mm). Relative interobserver deviations were within 2.0% of average (all patients) for base-length, and 5.0% (>80% of patients) for ocular protrusion. Pairwise interobserver comparison showed no significant differences between interocular differences of protrusion.ConclusionsRespective measurements of base-length, ocular protrusion, and deriving interocular difference show high interdisciplinary interobserver reproducibility in tilt-corrected axial MDCT-images of the orbit or mid-face.Advances in knowledgeHertel-exophthalmometry-like protrusion measurements did not depend on the years of experience or the medical subspecialty of the observer. Measurements are objective, well reproducible and important for multiple medical disciplines and should thus be included in pertinent radiology reports.

Project description:INTRODUCTION:Strong correlation has been demonstrated between tumor dose and response and between healthy liver dose and side effects. Individualized dosimetry is increasingly recommended in the current clinical routine. However, hepatic and tumor segmentations could be complex in some cases. The aim of this study is to assess the reproducibility of the tumoral and non-tumoral liver dosimetry in selective internal radiation therapy (SIRT). MATERIAL AND METHODS:Twenty-three patients with hepatocellular carcinoma (HCC) who underwent SIRT with glass microspheres were retrospectively included in the study. Tumor (TV) and total liver volumes (TLV), and mean absorbed doses in tumoral liver (TD) and non-tumoral liver (THLD) were determined on the 90Y PET/CT studies using Simplicit90YTM software, by three independent observers. Dosimetry datasets were obtained by a medical physicist helped by a nuclear medicine (NM) physician with 10 years of experience (A), by a NM physician with 4-year experience (B), and by a resident who first performed 10 dosimetry assessments as a training (C). Inter-observer agreement was evaluated using intra-class correlation coefficients (ICC), coefficients of variation (CV), Bland-Altman plots, and reproducibility coefficient (RDC). RESULTS:A strong agreement was observed between all three readers for estimating TLV (ICC 0.98) and THLD (ICC 0.97). Agreement was lower for TV delineation (ICC 0.94) and particularly for TD (ICC 0.73), especially for the highest values. Regarding TD, the CV (%) was 26.5, 26.9, and 20.2 between observers A and B, A and C, and B and C, respectively, and the RDC was 1.5. Regarding THLD, it was 8.5, 12.7, and 9.4, and the RDC was 1.3. CONCLUSION:Using a standardized methodology, and regardless of the different experiences of the observers, the estimation of THLD is highly reproducible. Although the reproducibility of the assessment of tumor irradiation is overall quite high, large variations may be observed in a limited number of patients.

Project description:ObjectivesEstablishing the reproducibility of expert-derived measurements on CTA exams of aortic dissection is clinically important and paramount for ground-truth determination for machine learning.MethodsFour independent observers retrospectively evaluated CTA exams of 72 patients with uncomplicated Stanford type B aortic dissection and assessed the reproducibility of a recently proposed combination of four morphologic risk predictors (maximum aortic diameter, false lumen circumferential angle, false lumen outflow, and intercostal arteries). For the first inter-observer variability assessment, 47 CTA scans from one aortic center were evaluated by expert-observer 1 in an unconstrained clinical assessment without a standardized workflow and compared to a composite of three expert-observers (observers 2-4) using a standardized workflow. A second inter-observer variability assessment on 30 out of the 47 CTA scans compared observers 3 and 4 with a constrained, standardized workflow. A third inter-observer variability assessment was done after specialized training and tested between observers 3 and 4 in an external population of 25 CTA scans. Inter-observer agreement was assessed with intraclass correlation coefficients (ICCs) and Bland-Altman plots.ResultsPre-training ICCs of the four morphologic features ranged from 0.04 (-0.05 to 0.13) to 0.68 (0.49-0.81) between observer 1 and observers 2-4 and from 0.50 (0.32-0.69) to 0.89 (0.78-0.95) between observers 3 and 4. ICCs improved after training ranging from 0.69 (0.52-0.87) to 0.97 (0.94-0.99), and Bland-Altman analysis showed decreased bias and limits of agreement.ConclusionsManual morphologic feature measurements on CTA images can be optimized resulting in improved inter-observer reliability. This is essential for robust ground-truth determination for machine learning models.Key points• Clinical fashion manual measurements of aortic CTA imaging features showed poor inter-observer reproducibility. • A standardized workflow with standardized training resulted in substantial improvements with excellent inter-observer reproducibility. • Robust ground truth labels obtained manually with excellent inter-observer reproducibility are key to develop reliable machine learning models.

Project description:BackgroundPET-based tumor delineation is an error prone and labor intensive part of image analysis. Especially for patients with advanced disease showing bulky tumor FDG load, segmentations are challenging. Reducing the amount of user-interaction in the segmentation might help to facilitate segmentation tasks especially when labeling bulky and complex tumors. Therefore, this study reports on segmentation workflows/strategies that may reduce the inter-observer variability for large tumors with complex shapes with different levels of user-interaction.MethodsTwenty PET images of bulky tumors were delineated independently by six observers using four strategies: (I) manual, (II) interactive threshold-based, (III) interactive threshold-based segmentation with the additional presentation of the PET-gradient image and (IV) the selection of the most reasonable result out of four established semi-automatic segmentation algorithms (Select-the-best approach). The segmentations were compared using Jaccard coefficients (JC) and percentage volume differences. To obtain a reference standard, a majority vote (MV) segmentation was calculated including all segmentations of experienced observers. Performed and MV segmentations were compared regarding positive predictive value (PPV), sensitivity (SE), and percentage volume differences.ResultsThe results show that with decreasing user-interaction the inter-observer variability decreases. JC values and percentage volume differences of Select-the-best and a workflow including gradient information were significantly better than the measurements of the other segmentation strategies (p-value<0.01). Interactive threshold-based and manual segmentations also result in significant lower and more variable PPV/SE values when compared with the MV segmentation.ConclusionsFDG PET segmentations of bulky tumors using strategies with lower user-interaction showed less inter-observer variability. None of the methods led to good results in all cases, but use of either the gradient or the Select-the-best workflow did outperform the other strategies tested and may be a good candidate for fast and reliable labeling of bulky and heterogeneous tumors.

Dataset Information

Retrospective comparison of approaches to evaluating inter-observer variability in CT tumour measurements in an academic health centre.

Background

Methods

Results

Conclusions

Publications

Retrospective comparison of approaches to evaluating inter-observer variability in CT tumour measurements in an academic health centre.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets