Dataset Information

Assessment of examiner leniency and stringency ('hawk-dove effect') in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling.

ABSTRACT:

Background

A potential problem of clinical examinations is known as the hawk-dove problem, some examiners being more stringent and requiring a higher performance than other examiners who are more lenient. Although the problem has been known qualitatively for at least a century, we know of no previous statistical estimation of the size of the effect in a large-scale, high-stakes examination. Here we use FACETS to carry out a multi-facet Rasch modelling of the paired judgements made by examiners in the clinical examination (PACES) of MRCP(UK), where identical candidates were assessed in identical situations, allowing calculation of examiner stringency.

Methods

Data were analysed from the first nine diets of PACES, which were taken between June 2001 and March 2004 by 10,145 candidates. Each candidate was assessed by two examiners on each of seven separate tasks. with the candidates assessed by a total of 1,259 examiners, resulting in a total of 142,030 marks. Examiner demographics were described in terms of age, sex, ethnicity, and total number of candidates examined.

Results

FACETS suggested that about 87% of main effect variance was due to candidate differences, 1% due to station differences, and 12% due to differences between examiners in leniency-stringency. Multiple regression suggested that greater examiner stringency was associated with greater examiner experience and being from an ethnic minority. Male and female examiners showed no overall difference in stringency. Examination scores were adjusted for examiner stringency and it was shown that for the present pass mark, the outcome for 95.9% of candidates would be unchanged using adjusted marks, whereas 2.6% of candidates would have passed, even though they had failed on the basis of raw marks, and 1.5% of candidates would have failed, despite passing on the basis of raw marks.

Conclusion

Examiners do differ in their leniency or stringency, and the effect can be estimated using Rasch modelling. The reasons for differences are not clear, but there are some demographic correlates, and the effects appear to be reliable across time. Account can be taken of differences, either by adjusting marks or, perhaps more effectively and more justifiably, by pairing high and low stringency examiners, so that raw marks can be used in the determination of pass and fail.

SUBMITTER: McManus IC

PROVIDER: S-EPMC1569374 | biostudies-literature | 2006 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Assessment of examiner leniency and stringency ('hawk-dove effect') in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling.

McManus I C IC Thompson M M Mollon J J

BMC medical education 20060818

<h4>Background</h4>A potential problem of clinical examinations is known as the hawk-dove problem, some examiners being more stringent and requiring a higher performance than other examiners who are more lenient. Although the problem has been known qualitatively for at least a century, we know of no previous statistical estimation of the size of the effect in a large-scale, high-stakes examination. Here we use FACETS to carry out a multi-facet Rasch modelling of the paired judgements made by exa ...[more]

PMID: 16919156

Similar Datasets

Project description:BackgroundFailure rates in postgraduate examinations are often high and many candidates therefore retake examinations on several or even many times. Little, however, is known about how candidates perform across those multiple attempts. A key theoretical question to be resolved is whether candidates pass at a resit because they have got better, having acquired more knowledge or skills, or whether they have got lucky, chance helping them to get over the pass mark. In the UK, the issue of resits has become of particular interest since the General Medical Council issued a consultation and is considering limiting the number of attempts candidates may make at examinations.MethodsSince 1999 the examination for Membership of the Royal Colleges of Physicians of the United Kingdom (MRCP(UK)) has imposed no limit on the number of attempts candidates can make at its Part 1, Part 2 or PACES (Clinical) examination. The present study examined the performance of candidates on the examinations from 2002/2003 to 2010, during which time the examination structure has been stable. Data were available for 70,856 attempts at Part 1 by 39,335 candidates, 37,654 attempts at Part 2 by 23,637 candidates and 40,303 attempts at PACES by 21,270 candidates, with the maximum number of attempts being 26, 21 and 14, respectively. The results were analyzed using multilevel modelling, fitting negative exponential growth curves to individual candidate performance.ResultsThe number of candidates taking the assessment falls exponentially at each attempt. Performance improves across attempts, with evidence in the Part 1 examination that candidates are still improving up to the tenth attempt, with a similar improvement up to the fourth attempt in Part 2 and the sixth attempt at PACES. Random effects modelling shows that candidates begin at a starting level, with performance increasing by a smaller amount at each attempt, with evidence of a maximum, asymptotic level for candidates, and candidates showing variation in starting level, rate of improvement and maximum level. Modelling longitudinal performance across the three diets (sittings) shows that the starting level at Part 1 predicts starting level at both Part 2 and PACES, and the rate of improvement at Part 1 also predicts the starting level at Part 2 and PACES.ConclusionCandidates continue to show evidence of true improvement in performance up to at least the tenth attempt at MRCP(UK) Part 1, although there are individual differences in the starting level, the rate of improvement and the maximum level that can be achieved. Such findings provide little support for arguments that candidates should only be allowed a fixed number of attempts at an examination. However, unlimited numbers of attempts are also difficult to justify because of the inevitable and ever increasing role that luck must play with increasing numbers of resits, so that the issue of multiple attempts might be better addressed by tackling the difficult question of how a pass mark should increase with each attempt at an exam.

Project description:BackgroundThe UK General Medical Council has emphasized the lack of evidence on whether graduates from different UK medical schools perform differently in their clinical careers. Here we assess the performance of UK graduates who have taken MRCP(UK) Part 1 and Part 2, which are multiple-choice assessments, and PACES, an assessment using real and simulated patients of clinical examination skills and communication skills, and we explore the reasons for the differences between medical schools.MethodWe perform a retrospective analysis of the performance of 5827 doctors graduating in UK medical schools taking the Part 1, Part 2 or PACES for the first time between 2003/2 and 2005/3, and 22453 candidates taking Part 1 from 1989/1 to 2005/3.ResultsGraduates of UK medical schools performed differently in the MRCP(UK) examination between 2003/2 and 2005/3. Part 1 and 2 performance of Oxford, Cambridge and Newcastle-upon-Tyne graduates was significantly better than average, and the performance of Liverpool, Dundee, Belfast and Aberdeen graduates was significantly worse than average. In the PACES (clinical) examination, Oxford graduates performed significantly above average, and Dundee, Liverpool and London graduates significantly below average. About 60% of medical school variance was explained by differences in pre-admission qualifications, although the remaining variance was still significant, with graduates from Leicester, Oxford, Birmingham, Newcastle-upon-Tyne and London overperforming at Part 1, and graduates from Southampton, Dundee, Aberdeen, Liverpool and Belfast underperforming relative to pre-admission qualifications. The ranking of schools at Part 1 in 2003/2 to 2005/3 correlated 0.723, 0.654, 0.618 and 0.493 with performance in 1999-2001, 1996-1998, 1993-1995 and 1989-1992, respectively.ConclusionCandidates from different UK medical schools perform differently in all three parts of the MRCP(UK) examination, with the ordering consistent across the parts of the exam and with the differences in Part 1 performance being consistent from 1989 to 2005. Although pre-admission qualifications explained some of the medical school variance, the remaining differences do not seem to result from career preference or other selection biases, and are presumed to result from unmeasured differences in ability at entry to the medical school or to differences between medical schools in teaching focus, content and approaches. Exploration of causal mechanisms would be enhanced by results from a national medical qualifying examination.

Project description:OBJECTIVES:Sources of bias, such as the examiners, domains and stations, can influence the student marks in objective structured clinical examination (OSCE). This study describes the extent to which the facets modelled in an OSCE can contribute to scoring variance and how they fit into a Many-Facet Rasch Model (MFRM) of OSCE performance. A further objective is to identify the functioning of the rating scale used. DESIGN:A non-experimental cross-sectional design. PARTICIPANTS AND SETTINGS:An MFRM was used to identify sources of error (eg, examiner, domain and station), which may influence the student outcome. A 16-station OSCE was conducted for 329 final year medical students. Domain-based marking was applied, each station using a sample from eight defined domains across the whole OSCE. The domains were defined as follows: communication skills, professionalism, information gathering, information giving, clinical interpretation, procedure, diagnosis and management. The domains in each station were weighted to ensure proper attention to the construct of the individual station. Four facets were assessed: students, examiners, domains and stations. RESULTS:The results suggest that the OSCE data fit the model, confirming that an MFRM approach was appropriate to use. The variable map allows a comparison with and between the facets of students, examiners, domains and stations and the 5-point score for each domain with each station as they are calibrated to the same scale. Fit statistics showed that the domains map well to the performance of the examiners. No statistically significant difference between examiner sensitivity (3.85 logits) was found. However, the results did suggest examiners were lenient and that some behaved inconsistently. The results also suggest that the functioning of response categories on the 5-point rating scale need further examination and optimisation. CONCLUSIONS:The results of the study have important implications for examiner monitoring and training activities, to aid assessment improvement.

Project description:PurposeEnsuring that examiners in different parallel circuits of objective structured clinical examinations (OSCEs) judge to the same standard is critical to the chain of validity. Recent work suggests examiner-cohort (i.e., the particular group of examiners) could significantly alter outcomes for some candidates. Despite this, examiner-cohort effects are rarely examined since fully nested data (i.e., no crossover between the students judged by different examiner groups) limit comparisons. In this study, the authors aim to replicate and further develop a novel method called Video-based Examiner Score Comparison and Adjustment (VESCA), so it can be used to enhance quality assurance of distributed or national OSCEs.MethodIn 2019, 6 volunteer students were filmed on 12 stations in a summative OSCE. In addition to examining live student performances, examiners from 8 separate examiner-cohorts scored the pool of video performances. Examiners scored videos specific to their station. Video scores linked otherwise fully nested data, enabling comparisons by Many Facet Rasch Modeling. Authors compared and adjusted for examiner-cohort effects. They also compared examiners' scores when videos were embedded (interspersed between live students during the OSCE) or judged later via the Internet.ResultsHaving accounted for differences in students' ability, different examiner-cohort scores for the same ability of student ranged from 18.57 of 27 (68.8%) to 20.49 (75.9%), Cohen's d = 1.3. Score adjustment changed the pass/fail classification for up to 16% of students depending on the modeled cut score. Internet and embedded video scoring showed no difference in mean scores or variability. Examiners' accuracy did not deteriorate over the 3-week Internet scoring period.ConclusionsExaminer-cohorts produced a replicable, significant influence on OSCE scores that was unaccounted for by typical assessment psychometrics. VESCA offers a promising means to enhance validity and fairness in distributed OSCEs or national exams. Internet-based scoring may enhance VESCA's feasibility.

Dataset Information

Assessment of examiner leniency and stringency ('hawk-dove effect') in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling.

Background

Methods

Results

Conclusion

Publications

Assessment of examiner leniency and stringency ('hawk-dove effect') in the MRCP(UK) clinical examination (PACES) using multi-facet Rasch modelling.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets