Article Text

Download PDFPDF
IDDF2024-ABS-0249 Performance of large language models in interpretation and summarization of endoscopy reports: a comparative analysis
  1. Kunping Ju,
  2. Qingzhou Kong,
  3. Yueyue Li,
  4. Yanqing Li
  1. Qilu Hospital of Shandong University, China

Abstract

Background Large language models (LLMs) offer new opportunities for the transformation of healthcare. This study aimed to assess the performance of different LLMs for the interpretation and summarization of endoscopy reports.

Methods This study was conducted in March 2024, and involved three mainstream LLMs (Claude 3 opus, Gemini pro, and GPT-4) which were tasked with interpreting and summarizing six fictitious endoscopy reports based on common clinical scenarios. Three gastroenterologists used Likert scales to blindly score the content generated by the LLMs, evaluating their accuracy, completeness, comprehensibility, or conciseness. Inter-rater reliability analysis was performed on the scoring dimensions. The interpretation of the reports was also evaluated for readability by the Flesch-Kincaid Reading Ease tool.

Results The acceptable proportions for the interpretations in terms of accuracy, completeness, and comprehensibility were 100%, 94.4%, and 100%. For the impressions, the acceptable proportions for accuracy, completeness, and conciseness were 94.4%, 72.2%, and 100%, respectively (IDDF2024-ABS-0249 Figure 1A). Interpretations from Claude 3 opus (49.59±8.95, p<0.001) and GPT-4 (55.01±5.13, p<0.0001) achieved significantly higher readability scores than fictitious reports (31.72±4.78), while interpretations from Gemini pro (38.24±3.74, p = 0.175) showed no differences. For scores based on Likert scales, the overall mean interpretation score for Claude 3 opus (11.11±0.76, p=0.025) and GPT-4 (11.06±1.11, p=0.027) was significantly higher than that from Gemini pro (10.11±1.08; IDDF2024-ABS-0249 Figure 1B). Regarding impressions, Claude 3 opus (10.56±1.98, p=0.011) and GPT-4 (10.94±0.87, p=0.009) also achieved significantly higher overall mean scores than Gemini pro (9.78±1.11; IDDF2024-ABS-0249 Figure 1C). Kendall’s W coefficient of concordance for accuracy, completeness and comprehensibility of interpretations and for accuracy, completeness and conciseness of impressions achieved a mean of 0.518 (moderate agreement), 0.502 (moderate agreement), 0.278 (fair agreement), 0.667 (good agreement), 0.302 (fair agreement), and 0.531 (moderate agreement), respectively.

Abstract IDDF2024-ABS-0249 Figure 1A
Abstract IDDF2024-ABS-0249 Figure 1B
Abstract IDDF2024-ABS-0249 Figure 1C

Conclusions LLMs excel in interpreting and summarizing endoscopy reports, but inter-model differences exist. Further optimization and development of specialized LLMs for endoscopic text processing applications may be warranted.

Statistics from Altmetric.com

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.