Dataset Information

Creating efficiencies in the extraction of data from randomized trials: a prospective evaluation of a machine learning and text mining tool.

ABSTRACT:

Background

Machine learning tools that semi-automate data extraction may create efficiencies in systematic review production. We evaluated a machine learning and text mining tool's ability to (a) automatically extract data elements from randomized trials, and (b) save time compared with manual extraction and verification.

Methods

For 75 randomized trials, we manually extracted and verified data for 21 data elements. We uploaded the randomized trials to an online machine learning and text mining tool, and quantified performance by evaluating its ability to identify the reporting of data elements (reported or not reported), and the relevance of the extracted sentences, fragments, and overall solutions. For each randomized trial, we measured the time to complete manual extraction and verification, and to review and amend the data extracted by the tool. We calculated the median (interquartile range [IQR]) time for manual and semi-automated data extraction, and overall time savings.

Results

The tool identified the reporting (reported or not reported) of data elements with median (IQR) 91% (75% to 99%) accuracy. Among the top five sentences for each data element at least one sentence was relevant in a median (IQR) 88% (83% to 99%) of cases. Among a median (IQR) 90% (86% to 97%) of relevant sentences, pertinent fragments had been highlighted by the tool; exact matches were unreliable (median (IQR) 52% [33% to 73%]). A median 48% of solutions were fully correct, but performance varied greatly across data elements (IQR 21% to 71%). Using ExaCT to assist the first reviewer resulted in a modest time savings compared with manual extraction by a single reviewer (17.9 vs. 21.6 h total extraction time across 75 randomized trials).

Conclusions

Using ExaCT to assist with data extraction resulted in modest gains in efficiency compared with manual extraction. The tool was reliable for identifying the reporting of most data elements. The tool's ability to identify at least one relevant sentence and highlight pertinent fragments was generally good, but changes to sentence selection and/or highlighting were often required.

Protocol

https://doi.org/10.7939/DVN/RQPJKS.

SUBMITTER: Gates A

PROVIDER: S-EPMC8369614 | biostudies-literature |

REPOSITORIES: biostudies-literature

ACCESS DATA

Similar Datasets

Project description:Early identification and intervention are imperative for suicide prevention. However, at-risk people often neither seek help nor take professional assessment. A tool to automatically assess their risk levels in natural settings can increase the opportunity for early intervention.The aim of this study was to explore whether computerized language analysis methods can be utilized to assess one's suicide risk and emotional distress in Chinese social media.A Web-based survey of Chinese social media (ie, Weibo) users was conducted to measure their suicide risk factors including suicide probability, Weibo suicide communication (WSC), depression, anxiety, and stress levels. Participants' Weibo posts published in the public domain were also downloaded with their consent. The Weibo posts were parsed and fitted into Simplified Chinese-Linguistic Inquiry and Word Count (SC-LIWC) categories. The associations between SC-LIWC features and the 5 suicide risk factors were examined by logistic regression. Furthermore, the support vector machine (SVM) model was applied based on the language features to automatically classify whether a Weibo user exhibited any of the 5 risk factors.A total of 974 Weibo users participated in the survey. Those with high suicide probability were marked by a higher usage of pronoun (odds ratio, OR=1.18, P=.001), prepend words (OR=1.49, P=.02), multifunction words (OR=1.12, P=.04), a lower usage of verb (OR=0.78, P<.001), and a greater total word count (OR=1.007, P=.008). Second-person plural was positively associated with severe depression (OR=8.36, P=.01) and stress (OR=11, P=.005), whereas work-related words were negatively associated with WSC (OR=0.71, P=.008), severe depression (OR=0.56, P=.005), and anxiety (OR=0.77, P=.02). Inconsistently, third-person plural was found to be negatively associated with WSC (OR=0.02, P=.047) but positively with severe stress (OR=41.3, P=.04). Achievement-related words were positively associated with depression (OR=1.68, P=.003), whereas health- (OR=2.36, P=.004) and death-related (OR=2.60, P=.01) words positively associated with stress. The machine classifiers did not achieve satisfying performance in the full sample set but could classify high suicide probability (area under the curve, AUC=0.61, P=.04) and severe anxiety (AUC=0.75, P<.001) among those who have exhibited WSC.SC-LIWC is useful to examine language markers of suicide risk and emotional distress in Chinese social media and can identify characteristics different from previous findings in the English literature. Some findings are leading to new hypotheses for future verification. Machine classifiers based on SC-LIWC features are promising but still require further optimization for application in real life.

Project description:ObjectivesThe study aimed to conduct a bibliometric analysis of publications concerning lumbar spondylolisthesis, as well as summarize its research topics and hotspot trends with machine-learning based text mining.MethodsThe data were extracted from the Web of Science Core Collection (WoSCC) database and then analyzed in Rstudio1.3.1 and CiteSpace5.8. Annual publication production and the top-20 productive authors over time were obtained. Additionally, top-20 productive journals and top-20 influential journals were compared by spine-subspecialty or not. Similarly, top-20 productive countries/regions and top-20 influential countries/regions were compared by they were developed countries/regions or not. The collaborative relationship among countries and institutions were presented. The main topics of lumbar spondylolisthesis were classified by Latent Dirichlet allocation (LDA) analysis, and the hotspot trends were indicated by keywords with strongest citation bursts.ResultsUp to 2021, a total number of 4,245 articles concerning lumbar spondylolisthesis were finally included for bibliometric analysis. Spine-subspecialty journals were found to be dominant in the productivity and the impact of the field, and SPINE, EUROPEAN SPINE JOURNAL and JOURNAL OF NEUROSURGERY-SPINE were the top-3 productive and the top-3 influential journals in this field. USA, Japan and China have contributed to over half of the publication productivity, but European countries seemed to publish more influential articles. It seemed that developed countries/regions tended to produce more articles and more influential articles, and international collaborations mainly occurred among USA, Europe and eastern Asia. Publications concerning surgical management was the major topic, followed by radiographic assessment and epidemiology for this field. Surgical management especially minimally invasive technique for lumbar spondylolisthesis were the recent hotspots over the past 5 years.ConclusionsThe study successfully summarized the productivity and impact of different entities, which should benefit the journal selection and pursuit of international collaboration for researcher who were interested in the field of lumbar spondylolisthesis. Additionally, the current study may encourage more researchers joining in the field and somewhat inform their research direction in the future.