Dataset Information

Training sample selection: Impact on screening automation in diagnostic test accuracy reviews.

ABSTRACT: When performing a systematic review, researchers screen the articles retrieved after a broad search strategy one by one, which is time-consuming. Computerised support of this screening process has been applied with varying success. This is partly due to the dependency on large amounts of data to develop models that predict inclusion. In this paper, we present an approach to choose which data to use in model training and compare it with established approaches. We used a dataset of 50 Cochrane diagnostic test accuracy reviews, and each was used as a target review. From the remaining 49 reviews, we selected those that most closely resembled the target review's clinical topic using the cosine similarity metric. Included and excluded studies from these selected reviews were then used to develop our prediction models. The performance of models trained on the selected reviews was compared against models trained on studies from all available reviews. The prediction models performed best with a larger number of reviews in the training set and on target reviews that had a research subject similar to other reviews in the dataset. Our approach using cosine similarity may reduce computational costs for model training and the duration of the screening process.

SUBMITTER: van Altena AJ

PROVIDER: S-EPMC9292892 | biostudies-literature | 2021 Nov

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Training sample selection: Impact on screening automation in diagnostic test accuracy reviews.

van Altena Allard J AJ Spijker René R Leeflang Mariska M G MMG Olabarriaga Sílvia Delgado SD

Research synthesis methods 20210825 6

When performing a systematic review, researchers screen the articles retrieved after a broad search strategy one by one, which is time-consuming. Computerised support of this screening process has been applied with varying success. This is partly due to the dependency on large amounts of data to develop models that predict inclusion. In this paper, we present an approach to choose which data to use in model training and compare it with established approaches. We used a dataset of 50 Cochrane dia ...[more]

PMID: 34390193

Similar Datasets

Project description:BackgroundThe large and increasing number of new studies published each year is making literature identification in systematic reviews ever more time-consuming and costly. Technological assistance has been suggested as an alternative to the conventional, manual study identification to mitigate the cost, but previous literature has mainly evaluated methods in terms of recall (search sensitivity) and workload reduction. There is a need to also evaluate whether screening prioritization methods leads to the same results and conclusions as exhaustive manual screening. In this study, we examined the impact of one screening prioritization method based on active learning on sensitivity and specificity estimates in systematic reviews of diagnostic test accuracy.MethodsWe simulated the screening process in 48 Cochrane reviews of diagnostic test accuracy and re-run 400 meta-analyses based on a least 3 studies. We compared screening prioritization (with technological assistance) and screening in randomized order (standard practice without technology assistance). We examined if the screening could have been stopped before identifying all relevant studies while still producing reliable summary estimates. For all meta-analyses, we also examined the relationship between the number of relevant studies and the reliability of the final estimates.ResultsThe main meta-analysis in each systematic review could have been performed after screening an average of 30% of the candidate articles (range 0.07 to 100%). No systematic review would have required screening more than 2308 studies, whereas manual screening would have required screening up to 43,363 studies. Despite an average 70% recall, the estimation error would have been 1.3% on average, compared to an average 2% estimation error expected when replicating summary estimate calculations.ConclusionScreening prioritization coupled with stopping criteria in diagnostic test accuracy reviews can reliably detect when the screening process has identified a sufficient number of studies to perform the main meta-analysis with an accuracy within pre-specified tolerance limits. However, many of the systematic reviews did not identify a sufficient number of studies that the meta-analyses were accurate within a 2% limit even with exhaustive manual screening, i.e., using current practice.

Project description:BackgroundSystematic review is an indispensable tool for optimal evidence collection and evaluation in evidence-based medicine. However, the explosive increase of the original literatures makes it difficult to accomplish critical appraisal and regular update. Artificial intelligence (AI) algorithms have been applied to automate the literature screening procedure in medical systematic reviews. In these studies, different algorithms were used and results with great variance were reported. It is therefore imperative to systematically review and analyse the developed automatic methods for literature screening and their effectiveness reported in current studies.MethodsAn electronic search will be conducted using PubMed, Embase, ACM Digital Library, and IEEE Xplore Digital Library databases, as well as literatures found through supplementary search in Google scholar, on automatic methods for literature screening in systematic reviews. Two reviewers will independently conduct the primary screening of the articles and data extraction, in which nonconformities will be solved by discussion with a methodologist. Data will be extracted from eligible studies, including the basic characteristics of study, the information of training set and validation set, and the function and performance of AI algorithms, and summarised in a table. The risk of bias and applicability of the eligible studies will be assessed by the two reviewers independently based on Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2). Quantitative analyses, if appropriate, will also be performed.DiscussionAutomating systematic review process is of great help in reducing workload in evidence-based practice. Results from this systematic review will provide essential summary of the current development of AI algorithms for automatic literature screening in medical evidence synthesis and help to inspire further studies in this field.Systematic review registrationPROSPERO CRD42020170815 (28 April 2020).

Project description:ObjectivesTo compare the accuracy of trained level 1 diabetic retinopathy (DR) graders (nurses, endocrinologists and one general practitioner), level 2 graders (midlevel ophthalmologists) and level 3 graders (senior ophthalmologists) in Vietnam against a reference standard from the UK and assess the impact of supplementary targeted grader training.DesignDiagnostic test accuracy study.SettingSecondary care hospitals in Southern Vietnam.ParticipantsDR training was delivered to Vietnamese graders in February 2018 by National Health Service (NHS) UK graders. Two-field retinal images (412 patient images) were graded by 14 trained graders in Vietnam between August and October 2018 and then regraded retrospectively by an NHS-certified reference standard UK optometrist (phase I). Further DR training based on phase I results was delivered to graders in November 2019. After training, a randomised subset of images from January to October 2020 (115 patient images) was graded by six of the original cohort (phase II). The reference grader regraded all images from phase I and II retrospectively in masked fashion.Primary and secondary outcome measuresSensitivity was calculated at the two different time points, and χ2 was used to test significance.ResultsIn phase I, the sensitivity for detecting any DR for all grader groups in Vietnam was low (41.8-42.2%) and improved in phase II after additional training was delivered (51.3-87.2%). The greatest improvement was seen among level 1 graders (p<0.001), and the lowest improvement was observed among level 3 graders (p=0.326). There was a statistically significant improvement in sensitivity for detecting referable DR and referable diabetic macular oedema between all grader levels. The post-training values ranged from 40.0 to 61.5% (including ungradable images) and 55.6%-90.0% (excluding ungradable images).ConclusionsThis study demonstrates that targeted training interventions can improve accuracy of DR grading. These findings have important implications for improving service delivery in DR screening programmes in low-resource settings.

Project description:Introduction Systemic lupus erythematosus (SLE) is a chronic autoimmune disease with multiorgan inflammatory involvement and a mortality rate that is 2.6-fold higher than individuals of the same age and sex in the general population. Approximately 50% of patients with SLE develop renal impairment (lupus nephritis). Delayed diagnosis of lupus nephritis is associated with a higher risk of progression to end-stage renal disease, the need for replacement therapy, and mortality. The initial clinical manifestations of lupus nephritis are often discrete or absent and are usually detected through complementary tests. Although widely used in clinical practice, their accuracy is limited. A great scientific effort has been exerted towards searching for new, more sensitive, and specific biomarkers in recent years. Some systematic reviews have individually evaluated new serum and urinary biomarkers tested in patients with lupus nephritis. This overview aimed to summarize systematic reviews on the accuracy of novel serum and urinary biomarkers for diagnosing lupus nephritis in patients with SLE, discussing how our results can guide the clinical management of the disease and the direction of research in this area. Methods The research question is “What is the accuracy of the new serum and urinary biomarkers studied for the diagnosis of LN in patients with SLE?”. We searched for systematic reviews of observational studies evaluating the diagnostic accuracy of new serum or urinary biomarkers of lupus nephritis. The following databases were included: PubMed, EMBASE, BIREME/LILACS, Scopus, Web of Science, and Cochrane, including gray literature found via Google Scholar and PROQUEST. Two authors assessed the reviews for inclusion, data extraction, and assessment of the risk of bias (ROBIS tool). Results Ten SRs on the diagnostic accuracy of new serum and urinary BMs in LN were selected. The SRs evaluated 7 distinct BMs: (a) antibodies (anti-Sm, anti-RNP, and anti-C1q), (b) cytokines (TWEAK and MCP-1), (c) a chemokine (IP-10), and (d) an acute phase glycoprotein (NGAL), in a total of 20 review arms (9 that analyzed serum BMs, and 12 that analyzed BMs in urine). The population evaluated in the primary studies was predominantly adults. Two SRs included strictly adults, 5 reviews also included studies in the paediatric population, and 4 did not report the age groups. The results of the evaluation with the ROBIS tool showed that most of the reviews had a low overall risk of bias. Conclusions There are 10 SRs of evidence relating to the diagnostic accuracy of serum and urinary biomarkers for lupus nephritis. Among the BMs evaluated, anti-C1q, urinary MCP-1, TWEAK, and NGAL stood out, highlighting the need for additional research, especially on LN diagnostic panels, and attempting to address methodological issues within diagnostic accuracy research. This would allow for a better understanding of their usefulness and possibly validate their clinical use in the future. Registration This project is registered on the International Prospective Registry of Systematic Reviews (PROSPERO) database (CRD42020196693).

Project description:ImportanceSystematic reviews of medical imaging diagnostic test accuracy (DTA) studies are affected by between-study heterogeneity due to a range of factors. Failure to appropriately assess the extent and causes of heterogeneity compromises the interpretability of systematic review findings.ObjectiveTo assess how heterogeneity has been examined in medical imaging DTA studies.Evidence reviewThe PubMed database was searched for systematic reviews of medical imaging DTA studies that performed a meta-analysis. The search was limited to the 40 journals with highest impact factor in the radiology, nuclear medicine, and medical imaging category in the InCites Journal Citation Reports of 2021 to reach a sample size of 200 to 300 included studies. Descriptive analysis was performed to characterize the imaging modality, target condition, type of meta-analysis model used, strategies for evaluating heterogeneity, and sources of heterogeneity identified. Multivariable logistic regression was performed to assess whether any factors were associated with at least 1 source of heterogeneity being identified in the included meta-analyses. Methodological quality evaluation was not performed. Data analysis occurred from October to December 2022.FindingsA total of 242 meta-analyses involving a median (range) of 987 (119-441 510) patients across a diverse range of disease categories and imaging modalities were included. The extent of heterogeneity was adequately described (ie, whether it was absent, low, moderate, or high) in 220 studies (91%) and was most commonly assessed using the I2 statistic (185 studies [76%]) and forest plots (181 studies [75%]). Heterogeneity was rated as moderate to high in 191 studies (79%). Of all included meta-analyses, 122 (50%) performed subgroup analysis and 87 (36%) performed meta-regression. Of the 242 studies assessed, 189 (78%) included 10 or more primary studies. Of these 189 studies, 60 (32%) did not perform meta-regression or subgroup analysis. Reasons for being unable to investigate sources of heterogeneity included inadequate reporting of primary study characteristics and a low number of included primary studies. Use of meta-regression was associated with identification of at least 1 source of variability (odds ratio, 1.90; 95% CI, 1.11-3.23; P = .02).Conclusions and relevanceIn this systematic review of assessment of heterogeneity in medical imaging DTA meta-analyses, most meta-analyses were impacted by a moderate to high level of heterogeneity, presenting interpretive challenges. These findings suggest that, despite the development and availability of more rigorous statistical models, heterogeneity appeared to be incomplete, inconsistently evaluated, or methodologically questionable in many cases, which lessened the interpretability of the analyses performed; comprehensive heterogeneity assessment should be addressed at the author level by improving personal familiarity with appropriate statistical methodology for assessing heterogeneity and involving biostatisticians and epidemiologists in study design, as well as at the editorial level, by mandating adherence to methodologic standards in primary DTA studies and DTA meta-analyses.

Project description:The standard item response theory (IRT) model assumption of a single homogenous population may be violated in real data. Mixture extensions of IRT models have been proposed to account for latent heterogeneous populations, but these models are not designed to handle multilevel data structures. Ignoring the multilevel structure is problematic as it results in lower-level units aggregated with higher-level units and yields less accurate results, because of dependencies in the data. Multilevel data structures cause such dependencies between levels but can be modeled in a straightforward way in multilevel mixture IRT models. An important step in the use of multilevel mixture IRT models is the fit of the model to the data. This fit is often determined based on relative fit indices. Previous research on mixture IRT models has shown that performances of these indices and classification accuracy of these models can be affected by several factors including percentage of class-variant items, number of items, magnitude and size of clusters, and mixing proportions of latent classes. As yet, no studies appear to have been reported examining these issues for multilevel extensions of mixture IRT models. The current study aims to investigate the effects of several features of the data on the accuracy of model selection and parameter recovery. Results are reported on a simulation study designed to examine the following features of the data: percentages of class-variant items (30, 60, and 90%), numbers of latent classes in the data (with from 1 to 3 latent classes at level 1 and 1 and 2 latent classes at level 2), numbers of items (10, 30, and 50), numbers of clusters (50 and 100), cluster size (10 and 50), and mixing proportions [equal (0.5 and 0.5) vs. non-equal (0.25 and 0.75)]. Simulation results indicated that multilevel mixture IRT models resulted in less accurate estimates when the number of clusters and the cluster size were small. In addition, mean Root mean square error (RMSE) values increased as the percentage of class-variant items increased and parameters were recovered more accurately under the 30% class-variant item conditions. Mixing proportion type (i.e., equal vs. unequal latent class sizes) and numbers of items (10, 30, and 50), however, did not show any clear pattern. Sample size dependent fit indices BIC, CAIC, and SABIC performed poorly for the smaller level-1 sample size. For the remaining conditions, the SABIC index performed better than other fit indices.

Dataset Information

Training sample selection: Impact on screening automation in diagnostic test accuracy reviews.

Publications

Training sample selection: Impact on screening automation in diagnostic test accuracy reviews.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets