Article Text
Abstract
Background Large language models (LLMs) offer new opportunities for the transformation of healthcare. This study aimed to assess the performance of different LLMs for the interpretation and summarization of endoscopy reports.
Methods This study was conducted in March 2024, and involved three mainstream LLMs (Claude 3 opus, Gemini pro, and GPT-4) which were tasked with interpreting and summarizing six fictitious endoscopy reports based on common clinical scenarios. Three gastroenterologists used Likert scales to blindly score the content generated by the LLMs, evaluating their accuracy, completeness, comprehensibility, or conciseness. Inter-rater reliability analysis was performed on the scoring dimensions. The interpretation of the reports was also evaluated for readability by the Flesch-Kincaid Reading Ease tool.
Results The acceptable proportions for the interpretations in terms of accuracy, completeness, and comprehensibility were 100%, 94.4%, and 100%. For the impressions, the acceptable proportions for accuracy, completeness, and conciseness were 94.4%, 72.2%, and 100%, respectively (IDDF2024-ABS-0249 Figure 1A). Interpretations from Claude 3 opus (49.59±8.95, p<0.001) and GPT-4 (55.01±5.13, p<0.0001) achieved significantly higher readability scores than fictitious reports (31.72±4.78), while interpretations from Gemini pro (38.24±3.74, p = 0.175) showed no differences. For scores based on Likert scales, the overall mean interpretation score for Claude 3 opus (11.11±0.76, p=0.025) and GPT-4 (11.06±1.11, p=0.027) was significantly higher than that from Gemini pro (10.11±1.08; IDDF2024-ABS-0249 Figure 1B). Regarding impressions, Claude 3 opus (10.56±1.98, p=0.011) and GPT-4 (10.94±0.87, p=0.009) also achieved significantly higher overall mean scores than Gemini pro (9.78±1.11; IDDF2024-ABS-0249 Figure 1C). Kendall’s W coefficient of concordance for accuracy, completeness and comprehensibility of interpretations and for accuracy, completeness and conciseness of impressions achieved a mean of 0.518 (moderate agreement), 0.502 (moderate agreement), 0.278 (fair agreement), 0.667 (good agreement), 0.302 (fair agreement), and 0.531 (moderate agreement), respectively.
Conclusions LLMs excel in interpreting and summarizing endoscopy reports, but inter-model differences exist. Further optimization and development of specialized LLMs for endoscopic text processing applications may be warranted.