Dataset Information

Evaluating multimodal AI in medical diagnostics.

ABSTRACT: This study evaluates multimodal AI models' accuracy and responsiveness in answering NEJM Image Challenge questions, juxtaposed with human collective intelligence, underscoring AI's potential and current limitations in clinical diagnostics. Anthropic's Claude 3 family demonstrated the highest accuracy among the evaluated AI models, surpassing the average human accuracy, while collective human decision-making outperformed all AI models. GPT-4 Vision Preview exhibited selectivity, responding more to easier questions with smaller images and longer questions.

SUBMITTER: Kaczmarczyk R

PROVIDER: S-EPMC11306783 | biostudies-literature | 2024 Aug

REPOSITORIES: biostudies-literature

ACCESS DATA

Publications

Evaluating multimodal AI in medical diagnostics.

Kaczmarczyk Robert R Wilhelm Theresa Isabelle TI Martin Ron R Roos Jonas J

NPJ digital medicine 20240807 1

This study evaluates multimodal AI models' accuracy and responsiveness in answering NEJM Image Challenge questions, juxtaposed with human collective intelligence, underscoring AI's potential and current limitations in clinical diagnostics. Anthropic's Claude 3 family demonstrated the highest accuracy among the evaluated AI models, surpassing the average human accuracy, while collective human decision-making outperformed all AI models. GPT-4 Vision Preview exhibited selectivity, responding more t ...[more]

PMID: 39112822

Similar Datasets

Project description:BackgroundReading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed.ObjectiveThis study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study.MethodsThe study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs' understanding of different sections of a research paper.ResultsLLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper-with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding.ConclusionsThis study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models.

Project description:BackgroundMedical students often struggle to engage with and retain complex pharmacology topics during their preclinical education. Traditional teaching methods can lead to passive learning and poor long-term retention of critical concepts.ObjectiveThis study aims to enhance the teaching of clinical pharmacology in medical school by using a multimodal generative artificial intelligence (genAI) approach to create compelling, cinematic clinical narratives (CCNs).MethodsWe transformed a standard clinical case into an engaging, interactive multimedia experience called "Shattered Slippers." This CCN used various genAI tools for content creation: GPT-4 for developing the storyline, Leonardo.ai and Stable Diffusion for generating images, Eleven Labs for creating audio narrations, and Suno for composing a theme song. The CCN integrated narrative styles and pop culture references to enhance student engagement. It was applied in teaching first-year medical students about immune system pharmacology. Student responses were assessed through the Situational Interest Survey for Multimedia and examination performance. The target audience comprised first-year medical students (n=40), with 18 responding to the Situational Interest Survey for Multimedia survey (n=18).ResultsThe study revealed a marked preference for the genAI-enhanced CCNs over traditional teaching methods. Key findings include the majority of surveyed students preferring the CCN over traditional clinical cases (14/18), as well as high average scores for triggered situational interest (mean 4.58, SD 0.53), maintained interest (mean 4.40, SD 0.53), maintained-feeling interest (mean 4.38, SD 0.51), and maintained-value interest (mean 4.42, SD 0.54). Students achieved an average score of 88% on examination questions related to the CCN material, indicating successful learning and retention. Qualitative feedback highlighted increased engagement, improved recall, and appreciation for the narrative style and pop culture references.ConclusionsThis study demonstrates the potential of using a multimodal genAI-driven approach to create CCNs in medical education. The "Shattered Slippers" case effectively enhanced student engagement and promoted knowledge retention in complex pharmacological topics. This innovative method suggests a novel direction for curriculum development that could improve learning outcomes and student satisfaction in medical education. Future research should explore the long-term retention of knowledge and the applicability of learned material in clinical settings, as well as the potential for broader implementation of this approach across various medical education contexts.

Project description:Background and objectiveMedical image segmentation is a vital aspect of medical image processing, allowing healthcare professionals to conduct precise and comprehensive lesion analyses. Traditional segmentation methods are often labor intensive and influenced by the subjectivity of individual physicians. The advent of artificial intelligence (AI) has transformed this field by reducing the workload of physicians, and improving the accuracy and efficiency of disease diagnosis. However, conventional AI techniques are not without challenges. Issues such as inexplicability, uncontrollable decision-making processes, and unpredictability can lead to confusion and uncertainty in clinical decision-making. This review explores the evolution of AI in medical image segmentation, focusing on the development and impact of explainable AI (XAI) and trustworthy AI (TAI).MethodsThis review synthesizes existing literature on traditional segmentation methods, AI-based approaches, and the transition from conventional AI to XAI and TAI. The review highlights the key principles and advancements in XAI that aim to address the shortcomings of conventional AI by enhancing transparency and interpretability. It further examines how TAI builds on XAI to improve the reliability, safety, and accountability of AI systems in medical image segmentation.Key content and findingsXAI has emerged as a solution to the limitations of conventional AI by providing greater transparency and interpretability, allowing healthcare professionals to better understand and trust AI-driven decisions. However, XAI itself faces challenges, including those related to safety, robustness, and value alignment. TAI has been developed to overcome these challenges, offering a more reliable framework for AI applications in medical image segmentation. By integrating the principles of XAI with enhanced safety and dependability, TAI addresses the critical need for TAI systems in clinical settings.ConclusionsTAI presents a promising future for medical image segmentation, combining the benefits of AI with improved reliability and safety. Thus, TAI is a more viable and dependable option for healthcare applications, and could ultimately lead to better clinical outcomes for patients, and advance the field of medical image processing.

Dataset Information

Evaluating multimodal AI in medical diagnostics.

Publications

Evaluating multimodal AI in medical diagnostics.

Similar Datasets

OmicsDI is part of the ELIXIR infrastructure

Tweets