Dataset Information

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument.

ABSTRACT:

Background

Diabetic retinopathy (DR) is a leading cause of vision loss in working age individuals worldwide. While screening is effective and cost effective, it remains underutilized, and novel methods are needed to increase detection of DR. This clinical validation study compared diagnostic gradings of retinal fundus photographs provided by volunteers on the Amazon Mechanical Turk (AMT) crowdsourcing marketplace with expert-provided gold-standard grading and explored whether determination of the consensus of crowdsourced classifications could be improved beyond a simple majority vote (MV) using regression methods.

Objective

The aim of our study was to determine whether regression methods could be used to improve the consensus grading of data collected by crowdsourcing.

Methods

A total of 1200 retinal images of individuals with diabetes mellitus from the Messidor public dataset were posted to AMT. Eligible crowdsourcing workers had at least 500 previously approved tasks with an approval rating of 99% across their prior submitted work. A total of 10 workers were recruited to classify each image as normal or abnormal. If half or more workers judged the image to be abnormal, the MV consensus grade was recorded as abnormal. Rasch analysis was then used to calculate worker ability scores in a random 50% training set, which were then used as weights in a regression model in the remaining 50% test set to determine if a more accurate consensus could be devised. Outcomes of interest were the percent correctly classified images, sensitivity, specificity, and area under the receiver operating characteristic (AUROC) for the consensus grade as compared with the expert grading provided with the dataset.

Results

Using MV grading, the consensus was correct in 75.5% of images (906/1200), with 75.5% sensitivity, 75.5% specificity, and an AUROC of 0.75 (95% CI 0.73-0.78). A logistic regression model using Rasch-weighted individual scores generated an AUROC of 0.91 (95% CI 0.88-0.93) compared with 0.89 (95% CI 0.86-92) for a model using unweighted scores (chi-square P value<.001). Setting a diagnostic cut-point to optimize sensitivity at 90%, 77.5% (465/600) were graded correctly, with 90.3% sensitivity, 68.5% specificity, and an AUROC of 0.79 (95% CI 0.76-0.83).

Conclusions

Crowdsourced interpretations of retinal images provide rapid and accurate results as compared with a gold-standard grading. Creating a logistic regression model using Rasch analysis to weight crowdsourced classifications by worker ability improves accuracy of aggregated grades as compared with simple majority vote.

SUBMITTER: Brady CJ

PROVIDER: S-EPMC5497070 | biostudies-literature | 2017 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument.

Brady Christopher John CJ Mudie Lucy Iluka LI Wang Xueyang X Guallar Eliseo E Friedman David Steven DS

Journal of medical Internet research 20170620 6

<h4>Background</h4>Diabetic retinopathy (DR) is a leading cause of vision loss in working age individuals worldwide. While screening is effective and cost effective, it remains underutilized, and novel methods are needed to increase detection of DR. This clinical validation study compared diagnostic gradings of retinal fundus photographs provided by volunteers on the Amazon Mechanical Turk (AMT) crowdsourcing marketplace with expert-provided gold-standard grading and explored whether determinati ...[more]

PMID: 28634154

Dataset Information

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument.

Background

Objective

Methods

Results

Conclusions

Publications

Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets

Similar Datasets

Rasch Analysis for Instrument Development: Why, When, and How?
| S-EPMC5132390 | biostudies-literature

Probabilistic Multigraph Modeling for Improving the Quality of Crowdsourced Affective Data.
| S-EPMC6771927 | biostudies-literature

Development of a physics-based force field for the scoring and refinement of protein models.
| S-EPMC2275715 | biostudies-literature

Validation of the Ocular Pain Assessment Survey Instrument With Rasch Analysis.
| S-EPMC11838117 | biostudies-literature

Instrument development, data collection, and characteristics of practices, staff, and measures in the Improving Quality of Care in Diabetes (iQuaD) Study.
| S-EPMC3130687 | biostudies-literature

Scoring haemophilic arthropathy on X-rays: improving inter- and intra-observer reliability and agreement using a consensus atlas.
| S-EPMC4869743 | biostudies-literature

Increase of Uncertainty in Summed-Score-Based Scoring in Non-Rasch IRT.
| S-EPMC12162545 | biostudies-literature

Applying Rasch analysis in refinement and validation of interpersonal skills measure for gifted children.
| S-EPMC10501403 | biostudies-literature

Refinement and Validation of the Empowerment Audiology Questionnaire: Rasch Analysis and Traditional Psychometric Evaluation.
| S-EPMC11008442 | biostudies-literature

Diagnosing capillary leak in critically ill patients: development of an innovative scoring instrument for non-invasive detection.
| S-EPMC8674404 | biostudies-literature