Dataset Information

The standard error of measurement is a more appropriate measure of quality for postgraduate medical assessments than is reliability: an analysis of MRCP(UK) examinations.

ABSTRACT:

Background

Cronbach's alpha is widely used as the preferred index of reliability for medical postgraduate examinations. A value of 0.8-0.9 is seen by providers and regulators alike as an adequate demonstration of acceptable reliability for any assessment. Of the other statistical parameters, Standard Error of Measurement (SEM) is mainly seen as useful only in determining the accuracy of a pass mark. However the alpha coefficient depends both on SEM and on the ability range (standard deviation, SD) of candidates taking an exam. This study investigated the extent to which the necessarily narrower ability range in candidates taking the second of the three part MRCP(UK) diploma examinations, biases assessment of reliability and SEM.

Methods

a) The interrelationships of standard deviation (SD), SEM and reliability were investigated in a Monte Carlo simulation of 10,000 candidates taking a postgraduate examination. b) Reliability and SEM were studied in the MRCP(UK) Part 1 and Part 2 Written Examinations from 2002 to 2008. c) Reliability and SEM were studied in eight Specialty Certificate Examinations introduced in 2008-9.

Results

The Monte Carlo simulation showed, as expected, that restricting the range of an assessment only to those who had already passed it, dramatically reduced the reliability but did not affect the SEM of a simulated assessment. The analysis of the MRCP(UK) Part 1 and Part 2 written examinations showed that the MRCP(UK) Part 2 written examination had a lower reliability than the Part 1 examination, but, despite that lower reliability, the Part 2 examination also had a smaller SEM (indicating a more accurate assessment). The Specialty Certificate Examinations had small Ns, and as a result, wide variability in their reliabilities, but SEMs were comparable with MRCP(UK) Part 2.

Conclusions

An emphasis upon assessing the quality of assessments primarily in terms of reliability alone can produce a paradoxical and distorted picture, particularly in the situation where a narrower range of candidate ability is an inevitable consequence of being able to take a second part examination only after passing the first part examination. Reliability also shows problems when numbers of candidates in examinations are low and sampling error affects the range of candidate ability. SEM is not subject to such problems; it is therefore a better measure of the quality of an assessment and is recommended for routine use.

SUBMITTER: Tighe J

PROVIDER: S-EPMC2893515 | biostudies-literature | 2010 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

The standard error of measurement is a more appropriate measure of quality for postgraduate medical assessments than is reliability: an analysis of MRCP(UK) examinations.

Tighe Jane J McManus I C IC Dewhurst Neil G NG Chis Liliana L Mucklow John J

BMC medical education 20100602

<h4>Background</h4>Cronbach's alpha is widely used as the preferred index of reliability for medical postgraduate examinations. A value of 0.8-0.9 is seen by providers and regulators alike as an adequate demonstration of acceptable reliability for any assessment. Of the other statistical parameters, Standard Error of Measurement (SEM) is mainly seen as useful only in determining the accuracy of a pass mark. However the alpha coefficient depends both on SEM and on the ability range (standard devi ...[more]

PMID: 20525220

Similar Datasets

Project description:BackgroundFailure rates in postgraduate examinations are often high and many candidates therefore retake examinations on several or even many times. Little, however, is known about how candidates perform across those multiple attempts. A key theoretical question to be resolved is whether candidates pass at a resit because they have got better, having acquired more knowledge or skills, or whether they have got lucky, chance helping them to get over the pass mark. In the UK, the issue of resits has become of particular interest since the General Medical Council issued a consultation and is considering limiting the number of attempts candidates may make at examinations.MethodsSince 1999 the examination for Membership of the Royal Colleges of Physicians of the United Kingdom (MRCP(UK)) has imposed no limit on the number of attempts candidates can make at its Part 1, Part 2 or PACES (Clinical) examination. The present study examined the performance of candidates on the examinations from 2002/2003 to 2010, during which time the examination structure has been stable. Data were available for 70,856 attempts at Part 1 by 39,335 candidates, 37,654 attempts at Part 2 by 23,637 candidates and 40,303 attempts at PACES by 21,270 candidates, with the maximum number of attempts being 26, 21 and 14, respectively. The results were analyzed using multilevel modelling, fitting negative exponential growth curves to individual candidate performance.ResultsThe number of candidates taking the assessment falls exponentially at each attempt. Performance improves across attempts, with evidence in the Part 1 examination that candidates are still improving up to the tenth attempt, with a similar improvement up to the fourth attempt in Part 2 and the sixth attempt at PACES. Random effects modelling shows that candidates begin at a starting level, with performance increasing by a smaller amount at each attempt, with evidence of a maximum, asymptotic level for candidates, and candidates showing variation in starting level, rate of improvement and maximum level. Modelling longitudinal performance across the three diets (sittings) shows that the starting level at Part 1 predicts starting level at both Part 2 and PACES, and the rate of improvement at Part 1 also predicts the starting level at Part 2 and PACES.ConclusionCandidates continue to show evidence of true improvement in performance up to at least the tenth attempt at MRCP(UK) Part 1, although there are individual differences in the starting level, the rate of improvement and the maximum level that can be achieved. Such findings provide little support for arguments that candidates should only be allowed a fixed number of attempts at an examination. However, unlimited numbers of attempts are also difficult to justify because of the inevitable and ever increasing role that luck must play with increasing numbers of resits, so that the issue of multiple attempts might be better addressed by tackling the difficult question of how a pass mark should increase with each attempt at an exam.

Project description:BackgroundThis study aimed to identify fit-for-purpose clinical outcome assessments (COAs) to evaluate physical function, as well as social and emotional well-being in clinical trials enrolling a pediatric population with achondroplasia. Qualitative interviews lasting up to 90 min were conducted in the US with children/adolescents with achondroplasia and/or their caregivers. Interviews utilized concept elicitation methodology to explore experiences and priorities for treatment outcomes. Cognitive debriefing methodology explored relevance and understanding of selected COAs.ResultsInterviews (N = 36) were conducted with caregivers of children age 0-2 years (n = 8) and 3-7 years (n = 7) and child/caregiver dyads with children age 8-11 years (n = 15) and 12-17 years (n = 6). Children/caregivers identified pain, short stature, impacts on physical functioning, and impacts on well-being (e.g. negative attention/comments) as key bothersome aspects of achondroplasia. Caregivers considered an increase in height (n = 9/14, 64%) and an improvement in limb proportion (n = 11/14, 71%) as successful treatment outcomes. The Childhood Health Assessment Questionnaire (CHAQ) and Quality of Life in Short Stature Youth (QoLISSY-Brief) were cognitively debriefed. CHAQ items evaluating activities, reaching, and hygiene were most relevant. QoLISSY-Brief items evaluating reaching, height bother, being treated differently, and height preventing doing things others could were most relevant. The CHAQ and QoLISSY-Brief instructions, item wording, response scales/options and recall period were well understood by caregivers and adolescents age 12-17. Some children aged 8-11 had difficulty reading, understanding, or required caregiver input. Feedback informed minor amendments to the CHAQ and the addition of a 7-day recall period to the QoLISSY-Brief. These amendments were subsequently reviewed and confirmed in N = 12 interviews with caregivers of children age 0-11 (n = 9) and adolescents age 12-17 (n = 3).ConclusionsAchondroplasia impacts physical functioning and emotional/social well-being. An increase in height and improvement in limb proportion are considered to be important treatment outcomes, but children/adolescents and their caregivers expect that a successful treatment should also improve important functional outcomes such as reach. The CHAQ (adapted for achondroplasia) and QoLISSY-Brief are relevant and appropriate measures of physical function and emotional/social well-being for pediatric achondroplasia trials; patient-report is recommended for age 12-17 years and caregiver-report is recommended for age 0-11 years.

Project description:BackgroundNeck pain is one of the leading causes of years lived with disability, and approximately half of people with neck pain experience recurrent episodes. Deficits in the sensorimotor system can persist even after pain relief, which may contribute to the chronic course of neck pain in some patients. Evaluation of sensorimotor capacities in patients with neck pain is therefore important. No consensus exists on how sensorimotor capacities of the neck should be assessed in physiotherapy. The aims of this systematic review are: (a) to provide an overview of tests used in physiotherapy for assessment of sensorimotor capacities in patients with neck pain; and (b) to provide information about reliability and measurement error of these tests, to enable physiotherapists to select appropriate tests.MethodsMedline, CINAHL, Embase and PsycINFO databases were searched for studies reporting data on the reliability and/or measurement error of sensorimotor tests in patients with neck pain. The results for reliability and measurement error were compared against the criteria for good measurement properties. The quality of evidence was assessed according to the modified GRADE method proposed by the COSMIN group.ResultsA total of 206 tests for assessment of sensorimotor capacities of the neck were identified and categorized into 18 groups of tests. The included tests did not cover all aspects of the sensorimotor system; tests for the sensory and motor components were identified, but not for the central integration component. Furthermore, no data were found on reliability or measurement error for some tests that are used in practice, such as movement control tests, which apply to the motor component. Approximately half of the tests showed good reliability, and 12 were rated as having good (+) reliability. However, tests that evaluated complex movements, which are more difficult to standardize, were less reliable. Measurement error could not be evaluated because the minimal clinically important change was not available for all tests.ConclusionOverall, the quality of evidence is not yet high enough to enable clear recommendations about which tests to use to assess the sensorimotor capacities of the neck.

Project description:BackgroundRepeat power ability (RPA) assessments are a valuable evaluation of an athlete's ability to repeatedly perform high intensity movements. Establishing the most reliable and valid loaded jump RPA assessment and method to quantify RPA has yet to be determined. This study aimed to compare the reliability and validity of an RPA assessment performed with loaded squat jumps (SJ) or countermovement jumps (CMJ) using force-time derived mean and peak power output.Materials and methodsRPA was quantified using calculations of average power output, a fatigue index and a percent decrement score for all repetitions and with the first and last repetitions removed. Validity was established by comparing to a 30 second Bosco repeated jump test (30BJT). Eleven well-trained male field hockey players performed one set of 20 repetitions of both SJs (20SJ) and CMJs (20CMJ) on separate occasions using a 30% one repetition maximum half squat load. These assessments were repeated 7 days apart to establish inter-test reliability. On a separate occasion, each participant performed the 30BJT.ResultsThe reliability of average peak power for 20SJ and 20CMJ was acceptable (CV < 5%; ICC > 0.9), while average mean power reliability for 20CMJ (CV < 5%; ICC > 0.9) was better than 20SJ (CV > 5%; ICC > 0.8). Percent decrement of 20CMJ peak power, with the first and final jump removed from the percent decrement calculation (PD%CMJpeak18), was the most reliable measurement of power output decline (CV < 5 %; ICC > 0.8). Average mean and peak power for both RPA protocols had moderate to strong correlations with 30BJT average mean and peak power (r = 0.5-0.8; p< 0.05-0.01). No RPA measurements of power decline were significantly related to BJT measurements of power decline.ConclusionsThese findings indicate that PD%CMJpeak18 is the most reliable measure of RPA power decline. The lack of relationship between power decline in the loaded RPA and the 30BJT assessment suggest that each assessment may be measuring a different physical quality. These results provide sport science practitioners with additional methods to assess RPA and provide useful information on the reliability and validity of these outcome measures. Additional research needs to be performed to examine the reliability and validity of the novel RPA assessments in other athletic populations and to determine the sensitivity of these measurements to training and injury.

Project description:ObjectiveLarge language models (LLMs) such as ChatGPT are being developed for use in research, medical education and clinical decision systems. However, as their usage increases, LLMs face ongoing regulatory concerns. This study aims to analyse ChatGPT's performance on a postgraduate examination to identify areas of strength and weakness, which may provide further insight into their role in healthcare.DesignWe evaluated the performance of ChatGPT 4 (24 May 2023 version) on official MRCP (Membership of the Royal College of Physicians) parts 1 and 2 written examination practice questions. Statistical analysis was performed using Python. Spearman rank correlation assessed the relationship between the probability of correctly answering a question and two variables: question difficulty and question length. Incorrectly answered questions were analysed further using a clinical reasoning framework to assess the errors made.SettingOnline using ChatGPT web interface.Primary and secondary outcome measuresPrimary outcome was the score (percentage questions correct) in the MRCP postgraduate written examinations. Secondary outcomes were qualitative categorisation of errors using a clinical decision-making framework.ResultsChatGPT achieved accuracy rates of 86.3% (part 1) and 70.3% (part 2). Weak but significant correlations were found between ChatGPT's accuracy and both just-passing rates in part 2 (r=0.34, p=0.0001) and question length in part 1 (r=-0.19, p=0.008). Eight types of error were identified, with the most frequent being factual errors, context errors and omission errors.ConclusionChatGPT performance greatly exceeded the passing mark for both exams. Multiple choice examinations provide a benchmark for LLM performance which is comparable to human demonstrations of knowledge, while also highlighting the errors LLMs make. Understanding the reasons behind ChatGPT's errors allows us to develop strategies to prevent them in medical devices that incorporate LLM technology.

Dataset Information

The standard error of measurement is a more appropriate measure of quality for postgraduate medical assessments than is reliability: an analysis of MRCP(UK) examinations.

Background

Methods

Results

Conclusions

Publications

The standard error of measurement is a more appropriate measure of quality for postgraduate medical assessments than is reliability: an analysis of MRCP(UK) examinations.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets