Dataset Information

Scoring reading parameters: An inter-rater reliability study using the MNREAD chart.

ABSTRACT:

Purpose

First, to evaluate inter-rater reliability when human raters estimate the reading performance of visually impaired individuals using the MNREAD acuity chart. Second, to evaluate the agreement between computer-based scoring algorithms and compare them with human rating.

Methods

Reading performance was measured for 101 individuals with low vision, using the Portuguese version of the MNREAD test. Seven raters estimated the maximum reading speed (MRS) and critical print size (CPS) of each individual MNREAD curve. MRS and CPS were also calculated automatically for each curve using two different algorithms: the original standard deviation method (SDev) and a non-linear mixed effects (NLME) modeling. Intra-class correlation coefficients (ICC) were used to estimate absolute agreement between raters and/or algorithms.

Results

Absolute agreement between raters was 'excellent' for MRS (ICC = 0.97; 95%CI [0.96, 0.98]) and 'moderate' to 'good' for CPS (ICC = 0.77; 95%CI [0.69, 0.83]). For CPS, inter-rater reliability was poorer among less experienced raters (ICC = 0.70; 95%CI [0.57, 0.80]) when compared to experienced ones (ICC = 0.82; 95%CI [0.76, 0.88]). Absolute agreement between the two algorithms was 'excellent' for MRS (ICC = 0.96; 95%CI [0.91, 0.98]). For CPS, the best possible agreement was found for CPS defined as the print size sustaining 80% of MRS (ICC = 0.77; 95%CI [0.68, 0.84]). Absolute agreement between raters and automated methods was 'excellent' for MRS (ICC = 0.96; 95% CI [0.88, 0.98] for SDev; ICC = 0.97; 95% CI [0.95, 0.98] for NLME). For CPS, absolute agreement between raters and SDev ranged from 'poor' to 'good' (ICC = 0.66; 95% CI [0.3, 0.80]), while agreement between raters and NLME was 'good' (ICC = 0.83; 95% CI [0.76, 0.88]).

Conclusion

For MRS, inter-rater reliability is excellent, even considering the possibility of noisy and/or incomplete data collected in low-vision individuals. For CPS, inter-rater reliability is lower. This may be problematic, for instance in the context of multisite investigations or follow-up examinations. The NLME method showed better agreement with the raters than the SDev method for both reading parameters. Setting up consensual guidelines to deal with ambiguous curves may help improve reliability. While the exact definition of CPS should be chosen on a case-by-case basis depending on the clinician or researcher's motivations, evidence suggests that estimating CPS as the smallest print size sustaining about 80% of MRS would increase inter-rater reliability.

SUBMITTER: Baskaran K

PROVIDER: S-EPMC6555504 | biostudies-literature | 2019

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Scoring reading parameters: An inter-rater reliability study using the MNREAD chart.

Baskaran Karthikeyan K Macedo Antonio Filipe AF He Yingchen Y Hernandez-Moreno Laura L Queirós Tatiana T Mansfield J Stephen JS Calabrèse Aurélie A

PloS one 20190607 6

<h4>Purpose</h4>First, to evaluate inter-rater reliability when human raters estimate the reading performance of visually impaired individuals using the MNREAD acuity chart. Second, to evaluate the agreement between computer-based scoring algorithms and compare them with human rating.<h4>Methods</h4>Reading performance was measured for 101 individuals with low vision, using the Portuguese version of the MNREAD test. Seven raters estimated the maximum reading speed (MRS) and critical print size ( ...[more]

PMID: 31173587

Similar Datasets

Project description:The objective of this study was to determine the inter-rater reliability of current scoring systems used to detect abomasal lesions in veal calves. In addition, macroscopic lesions were compared with corresponding histological lesions. For this, 76 abomasa were retrieved from veal calves in a slaughterhouse in Quebec and scored by four independent raters using current scoring systems. The localisations of the lesions were separated into pyloric, fundic, or torus pyloricus areas. Lesions were classified into three different types, i.e., erosions, ulcers, and scars. To estimate the inter-rater reliability, the coefficient type 1 of Gwet's agreement and Fleiss κ were used for the presence or absence of a lesion, and the intra-class correlation coefficient was used for the number of lesions. All veal calves had at least one abomasal lesion detected. Most lesions were erosions, and most of them were located in the pyloric area. Overall, a poor to very good inter-rater agreement was seen for the pyloric area and the torus pyloricus regarding the presence or absence of a lesion (Fleiss κ: 0.00-0.34; Gwet's AC1: 0.12-0.83), although a higher agreement was observed when combining all lesions in the pyloric area (Fleiss κ: 0.09-0.12; Gwet's AC1: 0.43-0.93). For the fundic area, a poor to very good agreement was also observed (Fleiss κ: 0.17-0.70; Gwet's AC1: 0.90-0.97). Regarding the inter-rater agreement for the number of lesions, a poor to moderate agreement was found (ICC: 0.11-0.73). When using the scoring system developed in the European Welfare Quality Protocol, a poor single random rater agreement (ICC: 0.42; 95% CI: 0.31-0.56) but acceptable average random rater agreement (ICC: 0.75; 95% CI: 0.64-0.83) was determined. Microscopic scar lesions were often mistaken as ulcers macroscopically. These results show that the scoring of abomasal lesions is challenging and highlight the need for a reliable scoring system. A fast, simple, and reliable scoring system would allow for large scale studies which investigate possible risk factors and hopefully help to prevent these lesions, which can compromise veal calves' health and welfare.

Project description:BACKGROUND:There is a growing trend in the use of mobile health (mHealth) technologies in traditional Chinese medicine (TCM) and telemedicine, especially during the coronavirus disease (COVID-19) outbreak. Tongue diagnosis is an important component of TCM, but also plays a role in Western medicine, for example in dermatology. However, the procedure of obtaining tongue images has not been standardized and the reliability of tongue diagnosis by smartphone tongue images has yet to be evaluated. OBJECTIVE:The first objective of this study was to develop an operating classification scheme for tongue coating diagnosis. The second and main objective of this study was to determine the intra-rater and inter-rater reliability of tongue coating diagnosis using the operating classification scheme. METHODS:An operating classification scheme for tongue coating was developed using a stepwise approach and a quasi-Delphi method. First, tongue images (n=2023) were analyzed by 2 groups of assessors to develop the operating classification scheme for tongue coating diagnosis. Based on clinicians' (n=17) own interpretations as well as their use of the operating classification scheme, the results of tongue diagnosis on a representative tongue image set (n=24) were compared. After gathering consensus for the operating classification scheme, the clinicians were instructed to use the scheme to assess tongue features of their patients under direct visual inspection. At the same time, the clinicians took tongue images of the patients with smartphones and assessed tongue features observed in the smartphone image using the same classification scheme. The intra-rater agreements of these two assessments were calculated to determine which features of tongue coating were better retained by the image. Using the finalized operating classification scheme, clinicians in the study group assessed representative tongue images (n=24) that they had taken, and the intra-rater and inter-rater reliability of their assessments was evaluated. RESULTS:Intra-rater agreement between direct subject inspection and tongue image inspection was good to very good (Cohen ? range 0.69-1.0). Additionally, when comparing the assessment of tongue images on different days, intra-rater reliability was good to very good (? range 0.7-1.0), except for the color of the tongue body (?=0.22) and slippery tongue fur (?=0.1). Inter-rater reliability was moderate for tongue coating (Gwet AC2 range 0.49-0.55), and fair for color and other features of the tongue body (Gwet AC2=0.34). CONCLUSIONS:Taken together, our study has shown that tongue images collected via smartphone contain some reliable features, including tongue coating, that can be used in mHealth analysis. Our findings thus support the use of smartphones in telemedicine for detecting changes in tongue coating.

Project description:Lung ultrasonography (LUS) is a non-invasive imaging method used to diagnose and monitor conditions such as pulmonary edema, pneumonia, and pneumothorax. It is precious where other imaging techniques like CT scan or chest X-rays are of limited access, especially in low- and middle-income countries with reduced resources. Furthermore, LUS reduces radiation exposure and its related blood cancer adverse events, which is particularly relevant in children and young subjects. The score obtained with LUS allows semi-quantification of regional loss of aeration, and it can provide a valuable and reliable assessment of the severity of most respiratory diseases. However, inter-observer reliability of the score has never been systematically assessed. This study aims to assess experienced LUS operators' agreement on a sample of video clips showing predefined findings. Twenty-five anonymized video clips comprehensively depicting the different values of LUS score were shown to renowned LUS experts blinded to patients' clinical data and the study's aims using an online form. Clips were acquired from five different ultrasound machines. Fleiss-Cohen weighted kappa was used to evaluate experts' agreement. Over a period of 3 months, 20 experienced operators completed the assessment. Most worked in the ICU (10), ED (6), HDU (2), cardiology ward (1), or obstetric/gynecology department (1). The proportional LUS score mean was 15.3 (SD 1.6). Inter-rater agreement varied: 6 clips had full agreement, 3 had 19 out of 20 raters agreeing, and 3 had 18 agreeing, while the remaining 13 had 17 or fewer people agreeing on the assigned score. Scores 0 and score 3 were more reproducible than scores 1 and 2. Fleiss' Kappa for overall answers was 0.87 (95% CI 0.815-0.931, p < 0.001). The inter-rater agreement between experienced LUS operators is very high, although not perfect. The strong agreement and the small variance enable us to say that a 20% tolerance around a measured value of a LUS score is a reliable estimate of the patient's true LUS score, resulting in reduced variability in score interpretation and greater confidence in its clinical use.

Project description:BackgroundThe common manual measurement technique of spinal sagittal alignment on X-rays is susceptible to rater-dependent variability, which has not been adequately considered in previous publications. This study investigates the effect of those variations in the characterization of patients receiving lumbar spondylodesis.MethodsGeneral alignment parameters on pre- and postoperative X-rays were evaluated by four raters in 43 prospectively sampled patients undergoing monolevel spondylodesis. The Intra-class Correlation Coefficient (ICC) for each rater pair and all raters together was calculated for inter-rater reliability. For the operation-induced change of the sagittal alignment in every patient the Wilcoxon test was applied to compare for each rater separately.ResultsThe ICCs were "good" (>0.75) to "excellent" (>0.9) for all raters together and for 45 of the 48 single rater pairs (93.75%). All revealed a significant increase of the addressed segmental lordosis and disc height and no significant change for spinopelvic parameters and sagittal vertical axis from pre- to postoperative. The lumbar lordosis showed a significant increase through the operation of +2.5° (p = 0.014) and +3.7° (p = 0.015) in two raters and no difference for the other ones (+2.1°, p = 0.171; -2.2°, p = 0.522).ConclusionsThe pre- to postoperative change of lumbar lordosis revealed different significance levels for different raters, although the ICCs were formally good. Accordingly, the evaluation by only one rater would lead to different conclusions. Due to this susceptibility of alignment measurements to rater-dependent variability, the exact evaluation process should be described in every publication and the consistency of significant results be validated through multiple raters.Trials registrationThe trial was approved by the local ethics committee and listed at the national clinical trials register ( DRKS00004514 , date of registration: 08/11/2012).

Project description:ObjectivesTo investigate the intra- and inter-rater reliability of the total radiomics quality score (RQS) and the reproducibility of individual RQS items' score in a large multireader study.MethodsNine raters with different backgrounds were randomly assigned to three groups based on their proficiency with RQS utilization: Groups 1 and 2 represented the inter-rater reliability groups with or without prior training in RQS, respectively; group 3 represented the intra-rater reliability group. Thirty-three original research papers on radiomics were evaluated by raters of groups 1 and 2. Of the 33 papers, 17 were evaluated twice with an interval of 1 month by raters of group 3. Intraclass coefficient (ICC) for continuous variables, and Fleiss' and Cohen's kappa (k) statistics for categorical variables were used.ResultsThe inter-rater reliability was poor to moderate for total RQS (ICC 0.30-055, p < 0.001) and very low to good for item's reproducibility (k - 0.12 to 0.75) within groups 1 and 2 for both inexperienced and experienced raters. The intra-rater reliability for total RQS was moderate for the less experienced rater (ICC 0.522, p = 0.009), whereas experienced raters showed excellent intra-rater reliability (ICC 0.91-0.99, p < 0.001) between the first and second read. Intra-rater reliability on RQS items' score reproducibility was higher and most of the items had moderate to good intra-rater reliability (k - 0.40 to 1).ConclusionsReproducibility of the total RQS and the score of individual RQS items is low. There is a need for a robust and reproducible assessment method to assess the quality of radiomics research.Clinical relevance statementThere is a need for reproducible scoring systems to improve quality of radiomics research and consecutively close the translational gap between research and clinical implementation.Key points• Radiomics quality score has been widely used for the evaluation of radiomics studies. • Although the intra-rater reliability was moderate to excellent, intra- and inter-rater reliability of total score and point-by-point scores were low with radiomics quality score. • A robust, easy-to-use scoring system is needed for the evaluation of radiomics research.

Dataset Information

Scoring reading parameters: An inter-rater reliability study using the MNREAD chart.

Purpose

Methods

Results

Conclusion

Publications

Scoring reading parameters: An inter-rater reliability study using the MNREAD chart.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets