Dataset Information

Error rates of human reviewers during abstract screening in systematic reviews.

ABSTRACT: BACKGROUND:Automated approaches to improve the efficiency of systematic reviews are greatly needed. When testing any of these approaches, the criterion standard of comparison (gold standard) is usually human reviewers. Yet, human reviewers make errors in inclusion and exclusion of references. OBJECTIVES:To determine citation false inclusion and false exclusion rates during abstract screening by pairs of independent reviewers. These rates can help in designing, testing and implementing automated approaches. METHODS:We identified all systematic reviews conducted between 2010 and 2017 by an evidence-based practice center in the United States. Eligible reviews had to follow standard systematic review procedures with dual independent screening of abstracts and full texts, in which citation inclusion by one reviewer prompted automatic inclusion through the next level of screening. Disagreements between reviewers during full text screening were reconciled via consensus or arbitration by a third reviewer. A false inclusion or exclusion was defined as a decision made by a single reviewer that was inconsistent with the final included list of studies. RESULTS:We analyzed a total of 139,467 citations that underwent 329,332 inclusion and exclusion decisions from 86 unique reviewers. The final systematic reviews included 5.48% of the potential references identified through bibliographic database search (95% confidence interval (CI): 2.38% to 8.58%). After abstract screening, the total error rate (false inclusion and false exclusion) was 10.76% (95% CI: 7.43% to 14.09%). CONCLUSIONS:This study suggests important false inclusion and exclusion rates by human reviewers. When deciding the validity of a future automated study selection algorithm, it is important to keep in mind that the gold standard is not perfect and that achieving error rates similar to humans may be adequate and can save resources and time.

SUBMITTER: Wang Z

PROVIDER: S-EPMC6959565 | biostudies-literature | 2020

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Error rates of human reviewers during abstract screening in systematic reviews.

Wang Zhen Z Nayfeh Tarek T Tetzlaff Jennifer J O'Blenis Peter P Murad Mohammad Hassan MH

PloS one 20200114 1

<h4>Background</h4>Automated approaches to improve the efficiency of systematic reviews are greatly needed. When testing any of these approaches, the criterion standard of comparison (gold standard) is usually human reviewers. Yet, human reviewers make errors in inclusion and exclusion of references.<h4>Objectives</h4>To determine citation false inclusion and false exclusion rates during abstract screening by pairs of independent reviewers. These rates can help in designing, testing and implemen ...[more]

PMID: 31935267

Similar Datasets

Project description:BACKGROUND: The production of high quality systematic reviews requires rigorous methods that are time-consuming and resource intensive. Citation screening is a key step in the systematic review process. An opportunity to improve the efficiency of systematic review production involves the use of non-expert groups and new technologies for citation screening. We performed a pilot study of citation screening by medical students using four screening methods and compared students' performance to experienced review authors. METHODS: The aims of this pilot randomised controlled trial were to provide preliminary data on the accuracy of title and abstract screening by medical students, and on the effect of screening modality on screening accuracy and efficiency. Medical students were randomly allocated to title and abstract screening using one of the four modalities and required to screen 650 citations from a single systematic review update. The four screening modalities were a reference management software program (EndNote), Paper, a web-based systematic review workflow platform (ReGroup) and a mobile screening application (Screen2Go). Screening sensitivity and specificity were analysed in a complete case analysis using a chi-squared test and Kruskal-Wallis rank sum test according to screening modality and compared to a final set of included citations selected by expert review authors. RESULTS: Sensitivity of medical students' screening decisions ranged from 46.7% to 66.7%, with students using the web-based platform performing significantly better than the paper-based group. Specificity ranged from 93.2% to 97.4% with the lowest specificity seen with the web-based platform. There was no significant difference in performance between the other three modalities. CONCLUSIONS: Medical students are a feasible population to engage in citation screening. Future studies should investigate the effect of incentive systems, training and support and analytical methods on screening performance. SYSTEMATIC REVIEW REGISTRATION: Cochrane Database CD001048.

Project description:BackgroundSystematic reviews are vital to the pursuit of evidence-based medicine within healthcare. Screening titles and abstracts (T&Ab) for inclusion in a systematic review is an intensive, and often collaborative, step. The use of appropriate tools is therefore important. In this study, we identified and evaluated the usability of software tools that support T&Ab screening for systematic reviews within healthcare research.MethodsWe identified software tools using three search methods: a web-based search; a search of the online "systematic review toolbox"; and screening of references in existing literature. We included tools that were accessible and available for testing at the time of the study (December 2018), do not require specific computing infrastructure and provide basic screening functionality for systematic reviews. Key properties of each software tool were identified using a feature analysis adapted for this purpose. This analysis included a weighting developed by a group of medical researchers, therefore prioritising the most relevant features. The highest scoring tools from the feature analysis were then included in a user survey, in which we further investigated the suitability of the tools for supporting T&Ab screening amongst systematic reviewers working in medical research.ResultsFifteen tools met our inclusion criteria. They vary significantly in relation to cost, scope and intended user community. Six of the identified tools (Abstrackr, Colandr, Covidence, DRAGON, EPPI-Reviewer and Rayyan) scored higher than 75% in the feature analysis and were included in the user survey. Of these, Covidence and Rayyan were the most popular with the survey respondents. Their usability scored highly across a range of metrics, with all surveyed researchers (n = 6) stating that they would be likely (or very likely) to use these tools in the future.ConclusionsBased on this study, we would recommend Covidence and Rayyan to systematic reviewers looking for suitable and easy to use tools to support T&Ab screening within healthcare research. These two tools consistently demonstrated good alignment with user requirements. We acknowledge, however, the role of some of the other tools we considered in providing more specialist features that may be of great importance to many researchers.

Project description:BACKGROUND:Machine learning tools can expedite systematic review (SR) processes by semi-automating citation screening. Abstrackr semi-automates citation screening by predicting relevant records. We evaluated its performance for four screening projects. METHODS:We used a convenience sample of screening projects completed at the Alberta Research Centre for Health Evidence, Edmonton, Canada: three SRs and one descriptive analysis for which we had used SR screening methods. The projects were heterogeneous with respect to search yield (median 9328; range 5243 to 47,385 records; interquartile range (IQR) 15,688 records), topic (Antipsychotics, Bronchiolitis, Diabetes, Child Health SRs), and screening complexity. We uploaded the records to Abstrackr and screened until it made predictions about the relevance of the remaining records. Across three trials for each project, we compared the predictions to human reviewer decisions and calculated the sensitivity, specificity, precision, false negative rate, proportion missed, and workload savings. RESULTS:Abstrackr's sensitivity was > 0.75 for all projects and the mean specificity ranged from 0.69 to 0.90 with the exception of Child Health SRs, for which it was 0.19. The precision (proportion of records correctly predicted as relevant) varied by screening task (median 26.6%; range 14.8 to 64.7%; IQR 29.7%). The median false negative rate (proportion of records incorrectly predicted as irrelevant) was 12.6% (range 3.5 to 21.2%; IQR 12.3%). The workload savings were often large (median 67.2%, range 9.5 to 88.4%; IQR 23.9%). The proportion missed (proportion of records predicted as irrelevant that were included in the final report, out of the total number predicted as irrelevant) was 0.1% for all SRs and 6.4% for the descriptive analysis. This equated to 4.2% (range 0 to 12.2%; IQR 7.8%) of the records in the final reports. CONCLUSIONS:Abstrackr's reliability and the workload savings varied by screening task. Workload savings came at the expense of potentially missing relevant records. How this might affect the results and conclusions of SRs needs to be evaluated. Studies evaluating Abstrackr as the second reviewer in a pair would be of interest to determine if concerns for reliability would diminish. Further evaluations of Abstrackr's performance and usability will inform its refinement and practical utility.

Project description:BackgroundDeveloping a comprehensive, reproducible literature search is the basis for a high-quality systematic review (SR). Librarians and information professionals, as expert searchers, can improve the quality of systematic review searches, methodology, and reporting. Likewise, journal editors and authors often seek to improve the quality of published SRs and other evidence syntheses through peer review. Health sciences librarians contribute to systematic review production but little is known about their involvement in peer reviewing SR manuscripts.MethodsThis survey aimed to assess how frequently librarians are asked to peer review systematic review manuscripts and to determine characteristics associated with those invited to review. The survey was distributed to a purposive sample through three health sciences information professional listservs.ResultsThere were 291 complete survey responses. Results indicated that 22% (n = 63) of respondents had been asked by journal editors to peer review systematic review or meta-analysis manuscripts. Of the 78% (n = 228) of respondents who had not already been asked, 54% (n = 122) would peer review, and 41% (n = 93) might peer review. Only 4% (n = 9) would not review a manuscript. Respondents had peer reviewed manuscripts for 38 unique journals and believed they were asked because of their professional expertise. Of respondents who had declined to peer review (32%, n = 20), the most common explanation was "not enough time" (60%, n = 12) followed by "lack of expertise" (50%, n = 10).The vast majority of respondents (95%, n = 40) had "rejected or recommended a revision of a manuscript| after peer review. They based their decision on the "search methodology" (57%, n = 36), "search write-up" (46%, n = 29), or "entire article" (54%, n = 34). Those who selected "other" (37%, n = 23) listed a variety of reasons for rejection, including problems or errors in the PRISMA flow diagram; tables of included, excluded, and ongoing studies; data extraction; reporting; and pooling methods.ConclusionsDespite being experts in conducting literature searches and supporting SR teams through the review process, few librarians have been asked to review SR manuscripts, or even just search strategies; yet many are willing to provide this service. Editors should involve experienced librarians with peer review and we suggest some strategies to consider.

Project description:BackgroundWe evaluated the benefits and risks of using the Abstrackr machine learning (ML) tool to semi-automate title-abstract screening and explored whether Abstrackr's predictions varied by review or study-level characteristics.MethodsFor a convenience sample of 16 reviews for which adequate data were available to address our objectives (11 systematic reviews and 5 rapid reviews), we screened a 200-record training set in Abstrackr and downloaded the relevance (relevant or irrelevant) of the remaining records, as predicted by the tool. We retrospectively simulated the liberal-accelerated screening approach. We estimated the time savings and proportion missed compared with dual independent screening. For reviews with pairwise meta-analyses, we evaluated changes to the pooled effects after removing the missed studies. We explored whether the tool's predictions varied by review and study-level characteristics.ResultsUsing the ML-assisted liberal-accelerated approach, we wrongly excluded 0 to 3 (0 to 14%) records that were included in the final reports, but saved a median (IQR) 26 (9, 42) h of screening time. One missed study was included in eight pairwise meta-analyses in one systematic review. The pooled effect for just one of those meta-analyses changed considerably (from MD (95% CI) - 1.53 (- 2.92, - 0.15) to - 1.17 (- 2.70, 0.36)). Of 802 records in the final reports, 87% were correctly predicted as relevant. The correctness of the predictions did not differ by review (systematic or rapid, P = 0.37) or intervention type (simple or complex, P = 0.47). The predictions were more often correct in reviews with multiple (89%) vs. single (83%) research questions (P = 0.01), or that included only trials (95%) vs. multiple designs (86%) (P = 0.003). At the study level, trials (91%), mixed methods (100%), and qualitative (93%) studies were more often correctly predicted as relevant compared with observational studies (79%) or reviews (83%) (P = 0.0006). Studies at high or unclear (88%) vs. low risk of bias (80%) (P = 0.039), and those published more recently (mean (SD) 2008 (7) vs. 2006 (10), P = 0.02) were more often correctly predicted as relevant.ConclusionOur screening approach saved time and may be suitable in conditions where the limited risk of missing relevant records is acceptable. Several of our findings are paradoxical and require further study to fully understand the tasks to which ML-assisted screening is best suited. The findings should be interpreted in light of the fact that the protocol was prepared for the funder, but not published a priori. Because we used a convenience sample, the findings may be prone to selection bias. The results may not be generalizable to other samples of reviews, ML tools, or screening approaches. The small number of missed studies across reviews with pairwise meta-analyses hindered strong conclusions about the effect of missed studies on the results and conclusions of systematic reviews.

Dataset Information

Error rates of human reviewers during abstract screening in systematic reviews.

Publications

Error rates of human reviewers during abstract screening in systematic reviews.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets