Dataset Information

Robust Neural Automated Essay Scoring Using Item Response Theory

ABSTRACT: Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to human grading. Conventional AES methods typically rely on manually tuned features, which are laborious to effectively develop. To obviate the need for feature engineering, many deep neural network (DNN)-based AES models have been proposed and have achieved state-of-the-art accuracy. DNN-AES models require training on a large dataset of graded essays. However, assigned grades in such datasets are known to be strongly biased due to effects of rater bias when grading is conducted by assigning a few raters in a rater set to each essay. Performance of DNN models rapidly drops when such biased data are used for model training. In the fields of educational and psychological measurement, item response theory (IRT) models that can estimate essay scores while considering effects of rater characteristics have recently been proposed. This study therefore proposes a new DNN-AES framework that integrates IRT models to deal with rater bias within training data. To our knowledge, this is a first attempt at addressing rating bias effects in training data, which is a crucial but overlooked problem.

SUBMITTER: Bittencourt I

PROVIDER: S-EPMC7334153 | biostudies-literature | 2020 Jun

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Allostatic load is commonly operationalized using a sum-score of high-risk biomarkers. However, this method implies that biomarkers contribute equally to allostatic load, as each is given equal weight. Our goal in this methodological paper is to evaluate this, and complementarily, to identify biomarkers that are most informative and least informative for developing an allostatic load index. Item response theory models provide an alternate approach to calculating the allostatic load score, by treating individual biomarkers (e.g. “items”) as indicators of a latent allostatic load construct. Item response theory scores account for the data-driven discriminating power of each biomarker, and an individual’s pattern of biomarker responses. To demonstrate feasibility of this approach, we used data from the 2015–2016 National Health Examination and Nutrition Survey (NHANES; N = 3751), with twelve allostatic load biomarkers representing immune response, metabolic function and cardiovascular health. Item response theory models revealed that body-mass-index and C-reactive protein were the most informative biomarkers for allostatic load. Both higher allostatic load sum-score and allostatic load item response theory score were associated with lower socio-economic status (p = 0.008; p<0.001, respectively). Further, both formulations of allostatic load were positively associated with a nine-item depression screener (p<0.001 for both), but only the item response theory score was also positively associated with the impact of depressive symptoms on daily life (p = 0.045). Item response theory scores may be more finely tuned to tease out effects, compared to sum-scores, and also provide more flexibility when there are missing biomarker measurements. Supplemental R code for our approach are included. Highlights • Methodological paper to introduce item response theory for calculating allostatic load.• Biomarker data from NHANES 2015–2016 representative of United States adults.• Body-mass-index and C-reactive protein most informative for allostatic load.• Item response theory captures more variability in allostatic load compared to sum-scores.• Future work - item response theory can standardize allostatic load across datasets.

Project description:Background: Subjective well-being refers to the extent to which a person believes or feels that her life is going well. It is considered as one of the best available proxies for a broader, more canonical form of well-being. For over 30 years, one important distinction in the conceptualization of subjective well-being is the contrast between more affective evaluations of biological emotional reactions and more cognitive evaluations of one's life in relation to a psychologically self-imposed ideal. More recently, researchers have suggested the addition of harmony in life, comprising behavioral evaluations of how one is doing in a social context. Since measures used to assess subjective well-being are self-reports, often validated only using Classical Test Theory, our aim was to focus on the psychometric properties of the measures using Item Response Theory. Method: A total of 1000 participants responded to the Positive Affect Negative Affect Schedule. At random, half of the participants answered to the Satisfaction with Life Scale or to the Harmony in life Scale. First, we evaluate and provide enough evidence of unidimensionality for each scale. Next, we conducted graded response models to validate the psychometric properties of the subjective well-being scales. Results: All scales showed varied frequency item distribution, high discrimination values (Alphas), and had different difficulty parameters (Beta) on each response options. For example, we identified items that respondents found difficult to endorse at the highest/lowest point of the scales (e.g., "Proud" for positive affect; item 5, "If I could live my life over, I would change almost nothing," for life satisfaction; and item 3, "I am in harmony," for harmony in life). In addition, all scales could cover a good portion of the range of subjective well-being (Theta): -2.50 to 2.30 for positive affect, -1.00 to 3.50 for negative affect, -2.40 to 2.50 for life satisfaction, and -2.40 to 2.50 for harmony in life. Importantly, for all scales, there were weak reliability for respondents with extreme latent scores of subjective well-being. Conclusion: The affective component, especially low levels of negative affect, were less accurately measured, while both the cognitive and social component were covered to an equal degree. There was less reliability for respondents with extreme latent scores of subjective well-being. Thus, to improve reliability at the level of the scale, at the item level and at the level of the response scale for each item, we point out specific items that need to be modified or added. Moreover, the data presented here can be used as normative data for each of the subjective well-being constructs.

Dataset Information

Robust Neural Automated Essay Scoring Using Item Response Theory

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets