Dataset Information

Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study.

ABSTRACT: BACKGROUND:Web applications that employ natural language processing technologies to support systematic reviewers during abstract screening have become more common. The goal of our project was to conduct a case study to explore a screening approach that temporarily replaces a human screener with a semi-automated screening tool. METHODS:We evaluated the accuracy of the approach using DistillerAI as a semi-automated screening tool. A published comparative effectiveness review served as the reference standard. Five teams of professional systematic reviewers screened the same 2472 abstracts in parallel. Each team trained DistillerAI with 300 randomly selected abstracts that the team screened dually. For all remaining abstracts, DistillerAI replaced one human screener and provided predictions about the relevance of records. A single reviewer also screened all remaining abstracts. A second human screener resolved conflicts between the single reviewer and DistillerAI. We compared the decisions of the machine-assisted approach, single-reviewer screening, and screening with DistillerAI alone against the reference standard. RESULTS:The combined sensitivity of the machine-assisted screening approach across the five screening teams was 78% (95% confidence interval [CI], 66 to 90%), and the combined specificity was 95% (95% CI, 92 to 97%). By comparison, the sensitivity of single-reviewer screening was similar (78%; 95% CI, 66 to 89%); however, the sensitivity of DistillerAI alone was substantially worse (14%; 95% CI, 0 to 31%) than that of the machine-assisted screening approach. Specificities for single-reviewer screening and DistillerAI were 94% (95% CI, 91 to 97%) and 98% (95% CI, 97 to 100%), respectively. Machine-assisted screening and single-reviewer screening had similar areas under the curve (0.87 and 0.86, respectively); by contrast, the area under the curve for DistillerAI alone was just slightly better than chance (0.56). The interrater agreement between human screeners and DistillerAI with a prevalence-adjusted kappa was 0.85 (95% CI, 0.84 to 0.86%). CONCLUSIONS:The accuracy of DistillerAI is not yet adequate to replace a human screener temporarily during abstract screening for systematic reviews. Rapid reviews, which do not require detecting the totality of the relevant evidence, may find semi-automation tools to have greater utility than traditional systematic reviews.

SUBMITTER: Gartlehner G

PROVIDER: S-EPMC6857277 | biostudies-literature | 2019 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study.

Gartlehner Gerald G Wagner Gernot G Lux Linda L Affengruber Lisa L Dobrescu Andreea A Kaminski-Hartenthaler Angela A Viswanathan Meera M

Systematic reviews 20191115 1

<h4>Background</h4>Web applications that employ natural language processing technologies to support systematic reviewers during abstract screening have become more common. The goal of our project was to conduct a case study to explore a screening approach that temporarily replaces a human screener with a semi-automated screening tool.<h4>Methods</h4>We evaluated the accuracy of the approach using DistillerAI as a semi-automated screening tool. A published comparative effectiveness review served ...[more]

PMID: 31727159

Similar Datasets

Project description:BACKGROUND:Machine learning tools can expedite systematic review (SR) processes by semi-automating citation screening. Abstrackr semi-automates citation screening by predicting relevant records. We evaluated its performance for four screening projects. METHODS:We used a convenience sample of screening projects completed at the Alberta Research Centre for Health Evidence, Edmonton, Canada: three SRs and one descriptive analysis for which we had used SR screening methods. The projects were heterogeneous with respect to search yield (median 9328; range 5243 to 47,385 records; interquartile range (IQR) 15,688 records), topic (Antipsychotics, Bronchiolitis, Diabetes, Child Health SRs), and screening complexity. We uploaded the records to Abstrackr and screened until it made predictions about the relevance of the remaining records. Across three trials for each project, we compared the predictions to human reviewer decisions and calculated the sensitivity, specificity, precision, false negative rate, proportion missed, and workload savings. RESULTS:Abstrackr's sensitivity was > 0.75 for all projects and the mean specificity ranged from 0.69 to 0.90 with the exception of Child Health SRs, for which it was 0.19. The precision (proportion of records correctly predicted as relevant) varied by screening task (median 26.6%; range 14.8 to 64.7%; IQR 29.7%). The median false negative rate (proportion of records incorrectly predicted as irrelevant) was 12.6% (range 3.5 to 21.2%; IQR 12.3%). The workload savings were often large (median 67.2%, range 9.5 to 88.4%; IQR 23.9%). The proportion missed (proportion of records predicted as irrelevant that were included in the final report, out of the total number predicted as irrelevant) was 0.1% for all SRs and 6.4% for the descriptive analysis. This equated to 4.2% (range 0 to 12.2%; IQR 7.8%) of the records in the final reports. CONCLUSIONS:Abstrackr's reliability and the workload savings varied by screening task. Workload savings came at the expense of potentially missing relevant records. How this might affect the results and conclusions of SRs needs to be evaluated. Studies evaluating Abstrackr as the second reviewer in a pair would be of interest to determine if concerns for reliability would diminish. Further evaluations of Abstrackr's performance and usability will inform its refinement and practical utility.

Project description:BackgroundSystematic reviews often require substantial resources, partially due to the large number of records identified during searching. Although artificial intelligence may not be ready to fully replace human reviewers, it may accelerate and reduce the screening burden. Using DistillerSR (May 2020 release), we evaluated the performance of the prioritization simulation tool to determine the reduction in screening burden and time savings.MethodsUsing a true recall @ 95%, response sets from 10 completed systematic reviews were used to evaluate: (i) the reduction of screening burden; (ii) the accuracy of the prioritization algorithm; and (iii) the hours saved when a modified screening approach was implemented. To account for variation in the simulations, and to introduce randomness (through shuffling the references), 10 simulations were run for each review. Means, standard deviations, medians and interquartile ranges (IQR) are presented.ResultsAmong the 10 systematic reviews, using true recall @ 95% there was a median reduction in screening burden of 47.1% (IQR: 37.5 to 58.0%). A median of 41.2% (IQR: 33.4 to 46.9%) of the excluded records needed to be screened to achieve true recall @ 95%. The median title/abstract screening hours saved using a modified screening approach at a true recall @ 95% was 29.8 h (IQR: 28.1 to 74.7 h). This was increased to a median of 36 h (IQR: 32.2 to 79.7 h) when considering the time saved not retrieving and screening full texts of the remaining 5% of records not yet identified as included at title/abstract. Among the 100 simulations (10 simulations per review), none of these 5% of records were a final included study in the systematic review. The reduction in screening burden to achieve true recall @ 95% compared to @ 100% resulted in a reduced screening burden median of 40.6% (IQR: 38.3 to 54.2%).ConclusionsThe prioritization tool in DistillerSR can reduce screening burden. A modified or stop screening approach once a true recall @ 95% is achieved appears to be a valid method for rapid reviews, and perhaps systematic reviews. This needs to be further evaluated in prospective reviews using the estimated recall.

Project description:(1) Background: The objective of this review was to synthesize available data on the use of machine learning to evaluate its accuracy (as determined by pooled sensitivity and specificity) in detecting keratoconus (KC), and measure reporting completeness of machine learning models in KC based on TRIPOD (the transparent reporting of multivariable prediction models for individual prognosis or diagnosis) statement. (2) Methods: Two independent reviewers searched the electronic databases for all potential articles on machine learning and KC published prior to 2021. The TRIPOD 29-item checklist was used to evaluate the adherence to reporting guidelines of the studies, and the adherence rate to each item was computed. We conducted a meta-analysis to determine the pooled sensitivity and specificity of machine learning models for detecting KC. (3) Results: Thirty-five studies were included in this review. Thirty studies evaluated machine learning models for detecting KC eyes from controls and 14 studies evaluated machine learning models for detecting early KC eyes from controls. The pooled sensitivity for detecting KC was 0.970 (95% CI 0.949-0.982), with a pooled specificity of 0.985 (95% CI 0.971-0.993), whereas the pooled sensitivity of detecting early KC was 0.882 (95% CI 0.822-0.923), with a pooled specificity of 0.947 (95% CI 0.914-0.967). Between 3% and 48% of TRIPOD items were adhered to in studies, and the average (median) adherence rate for a single TRIPOD item was 23% across all studies. (4) Conclusions: Application of machine learning model has the potential to make the diagnosis and monitoring of KC more efficient, resulting in reduced vision loss to the patients. This review provides current information on the machine learning models that have been developed for detecting KC and early KC. Presently, the machine learning models performed poorly in identifying early KC from control eyes and many of these research studies did not follow established reporting standards, thus resulting in the failure of these clinical translation of these machine learning models. We present possible approaches for future studies for improvement in studies related to both KC and early KC models to more efficiently and widely utilize machine learning models for diagnostic process.

Project description:BackgroundIn the next 15 to 20 years, the Chinese population will reach a plateau and start to decline. With the changing family structure and rushed urbanization policies, there will be greater demand for high-quality medical resources at urban centers and home-based elderly care driven by telehealth solutions. This paper describes an exploratory study regarding elderly users' preference for telehealth solutions in the next 5 to 10 years in 4 cities, Shenzhen, Hangzhou, Wuhan, and Yichang.ObjectiveThe goal is to analyze why users choose telehealth solutions over traditional health solutions based on a questionnaire study involving 4 age groups (50-60, 61-70, 71-80, and 80+) in 4 cities (Shenzhen, Hangzhou, Wuhan, and Yichang) in the next 10 to 20 years. The legal retirement age for female workers in China is 50 to 55 years and 60 years for male workers. To simulate reality in terms of elderly care in China, the authors use the Chinese definition of elderly for employees, defined as being 50 to 60 years old rather than 65 years, as defined by the World Health Organization.MethodsThe questionnaires were collected from Shenzhen, Hangzhou, Wuhan, and Yichang randomly with 390 valid data samples. The questionnaire consists of 31 questions distributed offline on tablet devices by local investigators. Subsequently, Stata 16.0 and SPSS 24.0 were used to analyze the data. O-logit ordered regression and principal component analysis (PCA) were the main theoretical models used. The study is currently in the exploratory stage and therefore does not seek generalization of the results.ResultsApproximately 71.09% (280/390) of the respondents reported having at least 1 type of chronic disease. We started with PCA and categorized all Likert scale variables into 3 factors. The influence of demographic variables on Factors 1, 2, and 3 was verified using analysis of variance (ANOVA) and t tests. The ordered logit regression results suggest that health-related motivations are positively related to the willingness to use telehealth solutions, and trust on data collected from telehealth solutions is negatively correlated with the willingness to use telehealth solutions.ConclusionsThe findings suggest that there is a need to address the gap in community health care and ensure health care continuity between different levels of health care institutions in China by providing telehealth solutions. Meanwhile, telehealth solution providers must focus on improving users' health awareness and lower health risk for chronic diseases by addressing lifestyle changes such as regular exercise and social activity. The interoperability between the electronic health record system and telehealth solutions remains a hurdle for telehealth solutions to add value in health care. The hurdle is that doctors neither adjust health care plans nor diagnose based on data collected by telehealth solutions.

Dataset Information

Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study.

Publications

Assessing the accuracy of machine-assisted abstract screening with DistillerAI: a user study.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets