Abstract
Objective
This study evaluates the accuracy of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot in answering rotator cuff injury questions and responding to patients. Results show Gemini excels in accuracy, while ChatGPT-4o performs better in patient interactions.
Methods
Phase 1: Four LLM chatbots answered physician test questions on rotator cuff injuries, interacting with patients and students. Their performance was assessed for accuracy and clarity across 108 multiple-choice and 20 clinical questions. Phase 2: Twenty patients questioned the top two chatbots (ChatGPT-4o, Gemini), with responses rated for satisfaction and readability. Three physicians evaluated accuracy, usefulness, safety, and completeness using a 5-point Likert scale. Statistical analyses and plotting used IBM SPSS 29.0.1.0 and Prism 10; Friedman test compared evaluation and readability scores among chatbots with Bonferroni-corrected pairwise comparisons, Mann-Whitney U test compared ChatGPT-4o versus Gemini; statistical significance at p < 0.05.
Results
Gemini achieved the highest average accuracy. In the second part, Gemini showed the highest proficiency in answering rotator cuff injury-related queries (accuracy: 4.70; completeness: 4.72; readability: 4.70; usefulness: 4.61; safety: 4.70, post hoc Dunnett test, p < 0.05). Additionally, 20 rotator cuff injury patients questioned the top two models from Phase 1 (ChatGPT-4o and Gemini). ChatGPT-4o had the highest reading difficulty score (14.22, post hoc Dunnett test, p < 0.05), suggesting a middle school reading level or above. Statistical analysis showed significant differences in patient satisfaction (4.52 vs. 3.76, p < 0.001) and readability (4.35 vs. 4.23). Orthopedic surgeons rated ChatGPT-4o higher in accuracy, completeness, readability, usefulness, and safety (all p < 0.05), outperforming Gemini in all aspects.
Conclusion
The study found that LLMs, particularly ChatGPT-4o and Gemini, excelled in understanding rotator cuff injury-related knowledge and responding to patients, showing strong potential for further development.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Rotator cuff injury remains a significant challenge in musculoskeletal medicine. The rotator cuff comprises four tendons and their associated muscles, playing a crucial role in stabilizing and mobilizing the shoulder joint. Damage to these structures due to trauma, overuse, or degenerative changes can result in pain, functional impairment, and a decline in quality of life. The overall attack rate of rotator cuff injury is above 40%, with a prevalence of 9.7% in patients under 20 years of age, compared with 62% in patients 80 years of age and older [1]. In the United States, outpatient visits for rotator cuff tears reach approximately 4.5 million annually [2]. Epidemiological studies carried out by Yamaguchi and others indicate that a rotator cuff injury incidence of up to 54% is found in the population aged more than 50 and that percentage tends to go higher in people above 60 [3]. Further studies show that it is expected to increase further with age [4]. Hence, timely and correct diagnosis of rotator cuff tears is crucial to managing the condition and achieving better outcomes for the patient.
Classes of rotator cuff lesions rely on clinical examination, imaging studies like ultrasound (US) and magnetic resonance imaging (MRI), or even surgery in some cases [5]. However, with the growth of artificial intelligence (AI) technology, and especially the recent use of natural language processing (NLP) models, the application of AI in healthcare has gained significant interest [6, 7]. AI technologies have the potential to enhance the accuracy of rotator cuff injuries by assisting in patient symptom analysis, medical history evaluation, and responsiveness to key clinical queries. Currently, ChatGPT, Google Gemini, and ERNIE Bot— which fall into the category of large language models (LLMs)—are exhibiting high language understanding skills in multiple domains. These models surpass basic language comprehension by interpreting symptoms shared by a patient and doctor as well as hypothesizing possible conditions to evaluate the accuracy of the provided diagnosis against known expertise in medicine [8, 9]. The efficiency of AI models, including ChatGPT and Gemini, in diagnosing rotator cuff injuries will be evaluated. Their accuracy and reliability will be compared to clinicians and traditional AI in tests and clinical practice. This research examines whether these models can enhance Clinical Decision Support Systems by improving clinical decision-making.
Knowing the importance of the role that AI plays in the medical sector, there arises a great opportunity for using it for the diagnostics processes in the field of medicine, particularly for stress factors such as communication disabilities between patients and physicians, and excessively high level of dependencies for every medical decision made [10, 11]. This research will study the possible ways these technologies can help in the early identification of rotator cuff injuries, and further the development of smart health care.
Method
Study design
This study comprises two phases: an initial retrospective cross-sectional analysis for testing, followed by a real-world validation study (Fig. 1). Conducted between January 9 and March 5, 2025, at the Department of Joint Orthopedics, Guanghua Integrated Chinese and Western Medicine Hospital in Shanghai, China, the research received approval from the hospital’s Ethics Committee (2025-K-29) and adhered to the principles of the Declaration of Helsinki. All participants provided written informed consent before participation.
Clarifying the connection between the study’s two phases is essential. The first phase served as an initial screening to identify four chatbot models, with its results used exclusively for selecting candidates for further assessment. Phase 2 was conducted independently regarding execution; however, the selection of chatbots for this phase depended entirely on the results obtained from Phase 1. The outcomes and implementation of this phase were not influenced by specific findings from the first phase, apart from the chatbot selection process.
Phase 1 retrospective cross-sectional analysis
In the first phase of the study, we selected 120 questions related to rotator cuff injuries from the physician assessment question bank of Guanghua Integrated Chinese and Western Medicine Hospital in Shanghai (Table S1). These questions were input into the online interfaces of four large language models (LLMs), and their responses were compared against standard answers to determine accuracy. Simultaneously, we invited 20 patients with rotator cuff injuries, 20 undergraduate medical students, and 20 postgraduate orthopedic surgery students to answer the same questions. The questions covered various topics, including the definition, etiology, symptoms and signs, imaging examinations, treatment methods, and postoperative rehabilitation and care of rotator cuff injuries. Additionally, we selected 20 questions from the clinical guidelines for rotator cuff injuries recommended by the American Academy of Orthopaedic Surgeons (AAOS) and the China Association of Chinese Medicine (CACM) [12, 13]. The 20 questions were input into the online interfaces of the four LLMs(Table S2-S5), each repeated three times to evaluate potential response variations among the models.
Next, three senior orthopedic surgeons evaluated each response based on the aforementioned guidelines and their own clinical experience. Specifically, they graded the accuracy of the answers using a five-point Likert scale:
1 = completely incorrect.
2 = mostly incorrect.
3 = equally correct and incorrect.
4 = mostly correct.
5 = completely correct.
To minimize statistical bias, the final score for each response was determined by calculating the median score given by the three experts.
To keep surgeons unaware of the LLM chatbots, all responses were converted into plain text. Each surgeon was randomly assigned one of three chatbot-generated answers per question. The four LLMs’ responses were blindly reviewed based on clinical expertise, with answers shuffled before evaluation. The evaluation occurred over four rounds on different dates, with a 48-hour washout period between sessions to reduce carryover effects [14]. To mitigate potential bias in readability assessments by medical professionals, an independent Chinese readability platform was used to objectively measure the reading difficulty of chatbot responses.
Phase 2: Real-world validation study
In the study’s second phase, 20 representative patients were recruited from the orthopedic outpatient clinic. They were randomly assigned to one of two LLM groups: ChatGPT-4o or Google Gemini, as these models performed best in the first phase. Each patient participated in a dedicated education session. Before chatbot interaction, a standardized prompt—“Please assist the orthopedic surgeon in educating the patient about rotator cuff injuries”—was used to establish context. Patients then posed three different questions about rotator cuff injuries to the chatbot. Following the interaction, they rated their satisfaction with the responses and assessed readability. Additionally, two orthopedic surgeons evaluated the chatbot-generated answers based on five key domains (Table S2-S5).
Study population
Three orthopedic surgeons met the following inclusion criteria:
-
1)
Serve as a senior attending physician in joint surgery with at least 8 years of clinical experience;
-
2)
Native Mandarin speakers;
-
3)
Experience in evaluating the quality of patient education materials and other informational resources.
Twenty patients met the following inclusion criteria:
-
1)
Diagnosed with rotator cuff injury;
-
2)
Aged 18 to 80 years, able to understand and participate in the study;
-
3)
Native Mandarin speakers.
Exclusion criteria:
-
1)
Presence of severe cognitive or language impairments that could affect communication;
-
2)
Patients with severe psychiatric disorders or other conditions that might affect study participation.
Rotator cuff injury question bank
Phase One began with multiple-choice questions from the physician assessment bank at Guanghua Integrated Chinese and Western Medicine Hospital in Shanghai. It also included clinical questions on rotator cuff injuries from AAOS and CACM guidelines. To ensure relevance, a single-blind method was used, where three experts anonymously ranked the questions based on their importance to rotator cuff injury patients. In the end, 120 multiple-choice and 20 clinical questions were selected.
Expert evaluation of LLM responses across five domains
In the second part of Phase One and Phase Two of the study, three experts evaluated the LLM responses based on a five-point Likert scale across five domains: accuracy, completeness, readability, usefulness, and safety.
The rating system was as follows:
1 = Completely incorrect.
2 = More incorrect than correct.
3 = Equally incorrect and correct.
4 = More correct than incorrect.
5 = Completely correct.
Additionally, the median score from the three experts was determined as the final score to reduce statistical bias.
Objective readability analysis
Research indicates that chatbots in English environments can generate accurate responses, but these are often complex and require higher education to understand. To address linguistic differences between Chinese and English, as well as factors like patients’ educational backgrounds and physicians’ experience, we utilized the Chinese Readability Platform (http://120.27.70.114:8000/analysis_a) for readability assessment [15]. This evaluation included reading difficulty scores and recommended reading age, with higher scores indicating lower text comprehensibility.
Statistical analysis
Statistical analysis and plotting were performed using IBM SPSS 29.0.1.0 and Prism 10.For common questions, the Friedman test was used to compare evaluation scores and readability scores among the chatbots. Pairwise comparisons were conducted using the Bonferroni correction to adjust p-values.The Mann-Whitney U test was used to compare the average response scores between ChatGPT-4o and Gemini.A p-value < 0.05 was considered statistically significant.
Result
Phase one, part one: accuracy of chatbot responses on standardized rotator cuff injury questions
In the first part of Phase One, four LLM chatbots, along with patients, undergraduate medical students, and postgraduate orthopedic surgery students, answered a set of physician assessment questions related to rotator cuff injuries. Figure 2 presents the number of correct and incorrect responses for the four LLM chatbots, as well as the average performance of the patient, undergraduate, and postgraduate groups. Among the LLMs, Gemini demonstrated the highest accuracy, correctly answering 80 out of 120 questions. There was no significant difference in the number of correct answers among the four LLMs. In contrast, undergraduate students had the lowest number of correct responses, answering 65 questions correctly (Fig. 2a). When analyzing the accuracy rate across different question categories, Gemini exhibited the highest average accuracy, while all four LLM chatbots achieved an average accuracy rate above 65% (Fig. 2b).For reference, Document S1 includes the full set of 120 multiple-choice questions along with their correct answers.
Responses of the four LLM chatbots, rotator cuff injury patients, undergraduate students, and postgraduate orthopedic surgery students to the 120 test questions in the first part of Phase One.(a) The number of correct and incorrect answers provided by the four LLM chatbots, rotator cuff injury patients, undergraduate students, and postgraduate orthopedic surgery students. (b) The accuracy rates of the four LLM chatbots, rotator cuff injury patients, undergraduate students, and postgraduate orthopedic surgery students in answering the test questions
Phase one, part two: expert evaluation of chatbot performance across five domains
Figure 3 presents the average scores of the LLM chatbots across five key domains in addressing clinical questions related to rotator cuff injuries. Among the four evaluated LLMs, Gemini demonstrated the highest proficiency in responding to queries regarding rotator cuff injuries, achieving the following scores:
Accuracy: 4.70.
Completeness: 4.72.
Readability: 4.70.
Usefulness: 4.61.
Safety: 4.70.
Evaluation of the responses from the four LLM chatbots using a 5-point Likert scale in the second part of Phase One. (a) The assessment framework used to evaluate each response. (b)–(g) The average scores of the four LLM chatbots across five domains: accuracy, completeness, readability, usefulness, and safety in the first phase of the study. Friedman test and post hoc Dunnett test were used to assess the statistical significance of the observed differences. Data are presented as mean ± standard deviation
A post hoc Dunnett test confirmed that Gemini’s performance was significantly superior (p < 0.05) compared to the other LLMs (Fig. 3).For a detailed breakdown of scores for each LLM chatbot across the five evaluation domains for individual questions (Tables S2–S5).
Figure 4 displays paragraph-level readability statistics for LLM chatbot responses. Among the models, ChatGPT-4o generated the most complex responses, with a Readability Difficulty Score of 14.22 (post hoc Dunnett test, p < 0.05), suggesting a reading level of middle school or higher. This indicates that a higher education level is needed for full comprehension. For a detailed breakdown of cumulative readability scores for each chatbot’s responses to individual questions(Table S2–S5)
Objective readability analysis of the responses generated by the four LLM chatbots in the second part of Phase One.(a) Recommended reading level, indicating the appropriate educational stage for understanding the responses (b) Reading difficulty score, representing the comprehension difficulty level of the responses generated by the four LLM chatbots.Friedman test and post hoc Dunnett test were used to assess the statistical significance of the observed differences. Data are presented as mean ± standard deviation
Phase two: real-world validation study
In the second phase of the study, 20 representative patients with rotator cuff injuries posed questions to the two best-performing language models from Phase One, ChatGPT-4o and Gemini. Statistical analysis showed a significant difference between ChatGPT-4o and Gemini in terms of patient satisfaction and readability scores (satisfaction: 4.52 vs. 3.76, p < 0.001; readability: 4.35 vs. 4.23) (Fig. 5a). The evaluation results of ChatGPT-4o and Gemini by orthopedic surgeons across five domains were as follows (accuracy: 4.40 vs. 3.70, p < 0.01; completeness: 4.30 vs. 3.50, p < 0.001; readability: 4.45 vs. 3.60, p < 0.001; usefulness: 4.40 vs. 3.95, p < 0.05; safety: 4.65 vs. 4.20, p < 0.01). ChatGPT-4o outperformed Gemini in every evaluated aspect (Fig. 5b and c). Table S5 provides a comprehensive breakdown of chatbot responses to patient inquiries and scores across the five assessment domains.
Evaluation of ChatGPT-4o and Gemini in addressing rotator cuff injury-related questions in the real-world assessment. (a) Patient satisfaction with the responses from the two LLM chatbots and the readability of responses in the real-world evaluation. (b) and (c) Average scores of ChatGPT-4o and Gemini across five domains: accuracy, completeness, readability, usefulness, and safety in the real-world assessment. A two-tailed t-test was used to assess the statistical significance of the observed differences. Data are presented as mean ± standard deviation
This study evaluated the performance of four large language models (ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot) in responding to inquiries from patients with rotator cuff injuries through a two-phase benchmarking test. The results showed that while Gemini demonstrated outstanding accuracy in medical knowledge, ChatGPT-4o excelled in real-world patient interactions, particularly in engagement and patient satisfaction. Although all chatbots performed well in standardized tests, with accuracy rates exceeding 65%, ChatGPT-4o received higher ratings from doctors and greater patient satisfaction due to its advantages in readability, linguistic adaptability, and emotional resonance in actual patient consultations.
This research conducted shows that the use of chatbots in the medial field demands not only medical proficiency, but also advances in linguistic expression, empathy and emotion. Further studies should build upon the existing knowledge on the maximization of LLM chatbots in actual clinical environments, particularly their capacity to cater to patients’ specific needs as well as those of the healthcare workers.
In conclusion, Gemini showed greater accuracy related Gemini’s medical knowledge however, ChatGPT-4o’s strengths in clinical interactions positions it better for potential real world utility. Chatbots are expected to work increasingly in the medical field of diagnostic assistance, patient consultation, and health management.
Discussion
In this research, we conducted a two-step assessment of ChatGPT-4o, ChatGPT-o1, Gemini, and ERNIE Bot’s responses to query regarding Chinese patients in relation to rotator cuff injury. Our results seem to coincide with prior findings regarding LLMs demonstrating impressive accuracy in medical question-answering as well as providing proper information to doctors and patients [16]. Moreover, other studies have managed to inform us about the weaknesses of LLMs in having natural emotional and linguistic interactions with the patients [17]. This work builds on previous conclusions by linking LLMs’ multidisciplinary performance differences. Based on our findings, we suggest that LLMs could greatly aid in managing rotator cuff injuries. They can serve as reliable sources for patient education, covering disease understanding, treatment options, and rehabilitation.
In the first phase, we created 120 multiple-choice questions on rotator cuff injuries and tested four LLM chatbots, as well as patients, undergraduate medical students, and postgraduate orthopedic students. The questions covered six areas: definition, etiology, symptoms, imaging, treatment, and rehabilitation. Gemini answered most questions correctly, while undergraduate students performed the worst. The definition section had the highest accuracy, and all sections except etiology scored over 60% on average.
Gemini had the highest accuracy among LLMs, all of which exceeded 65%. Surprisingly, patients outperformed medical students and even ChatGPT-4o and ChatGPT-o1. This suggests that patients, having firsthand experience, gain deeper knowledge after diagnosis. Overall, all LLMs performed well in rotator cuff pathology, with Gemini scoring the highest.
In Phase One’s second part, we took 20 rotator cuff injury-related questions from clinical guidelines and asked them to four LLM chatbots. Responses were analyzed by three orthopedic surgeons across five parameters: accuracy, completeness, readability, usefulness, and safety. Among the four models, the best performer in answering clinical questions was ChatGPT-4o, whereas ChatGPT-o1 was second best. In general, the assessments obtained from the orthopedic surgeons suggest that, indeed, all four LLM chatbots have the potential to solve basic clinical problems pertaining to rotator cuff injuries comprehensively and accurately.
To evaluate the bot responses, we performed an automated readability assessment check. The findings exhibit that ChatGPT-4o had the maximum value for reading difficulty, meaning ChatGPT-4o responses are the most difficult to interpret. Gemini achieved the highest accuracy in knowledge-based Q&A, and it could be due to its model structure and training corpus. Being the latest generation Google’s LLM, Gemini has higher language understanding and generation abilities compared to other LLMs and has access to a wider breadth of medical knowledge [18]. This means that Gemini can capture medical queries more accurately in its context and provide better answers. On the other hand, while Gemini excelled in knowledge based Q&A, ChatGPT-4o demonstrated superior performance in interactions with the patients. This highlights the strengths of ChatGPT-4o in terms of seamlessly interacting with patients, being adaptive, and engaging actively all while ensuring high readability.
To assess real-world usability, 20 patients evaluated the top two chatbots from Phase One, Gemini and ChatGPT-4o. Both effectively addressed patient concerns, with responses rated as accurate, useful, and safe. ChatGPT-4o had an advantage in response comprehension. This is likely the result of ChatGPT-4o outperforming the competition in natural language processing [19]. ChatGPT-4o also tends to have more advanced semantic understanding context-based adaptability along with fluent and natural speech generation. ChatGPT-4o is also more sensitive to informal patient language and gives greater priority to emotional aspects [20]. ChatGPT-4o’s empathetic tone and interactive style enhance patient comprehension and satisfaction. Its patient-centered responses, higher readability, and emotional support make interactions more engaging, explaining its preference over Gemini.
This study found that standardized tests alone cannot fully evaluate LLMs, as their results may not reflect real clinical performance. These tests are structured and closed-ended, mainly assessing knowledge. However, real patient inquiries are unique, varied, and often emotionally charged, making patient interaction more complex. Even when responses are accurate, excessive medical jargon can reduce patient satisfaction. This highlights the need to assess both the chatbot’s medical accuracy and its ability to communicate effectively with patients. Specifically, while accurate medical information is crucial, responses filled with complex medical terminology or overly technical explanations may confuse patients, potentially reducing adherence to treatment plans and diminishing trust. On the other hand, overly simplified or highly empathetic interactions, although comforting, might omit important medical specifics required for informed decisions. Therefore, effective clinical communication via LLMs must harmonize precision and empathy. Practical strategies include training LLMs to employ patient-centered, easily understandable language, proactively clarifying complex medical terms, and maintaining a compassionate tone that validates patient concerns without sacrificing essential medical accuracy. Future advancements could integrate adaptive learning algorithms that dynamically adjust response detail and emotional tone based on real-time patient interactions or demographic factors, thus optimizing patient satisfaction and enhancing the overall quality of clinical education.
This study has several limitations. First, the relatively small sample size of 20 patients involved in the real-world validation significantly limits the statistical power and generalizability of our findings. Future studies with larger, multi-center cohorts would further validate these initial results and enhance their applicability across diverse populations. Secondly, the study subjects were only Chinese people suffering from rotator cuff injuries. Further studies may broaden the scope to include subjects with varying age, culture, and education to achieve better generalization. Additionally, since all participants were native Chinese speakers, our readability assessment was specifically tailored for the Chinese language. Linguistic and cultural nuances inherent to Mandarin could significantly influence chatbot performance, potentially affecting patients’ perception and understanding of provided information. For instance, cultural expectations regarding medical authority, patient-physician interactions, and communication styles might differ substantially from Western contexts. Thus, it is essential for future research to examine chatbot efficacy across diverse linguistic and cultural populations, to enhance generalizability and ensure these technologies can be effectively adapted globally. Moreover, this study was mainly concerned with the performance of LLMs in knowledge-based Q&As and patient interviews, rather than their function in the physician’s diagnosis and treatment decisions.
We have considered several issues in using LLMs in medicine. First, LLMs do not guarantee accuracy. Their output must be verified and sometimes overridden by physicians to be clinically useful. Second, LLMs struggle with complex cases. While they handle standard questions well, they often fail in individualized cases involving medical history, changing conditions, and patient preferences. In the future, LLMs could adapt their responses based on a patient’s education and language preference through personalized learning. Additionally, their use raises concerns about data privacy and ethics. Future research should focus on ensuring ethical compliance and maximizing patient benefits.
Data availability
All the data are included herein (main text and supplementary section). The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request.
References
Jiang X, Zhang H, Wu Q, Chen Y, Jiang T. Comparison of three common shoulder injections for rotator cuff tears: A systematic review and network meta-analysis. J Orthop Surg Res. 2023;18:272. https://doi.org/10.1186/s13018-023-03747-z.
Guo M, Wang W, Li M. Compressed sensing magnetic resonance imaging (CS-MRI) diagnosis of rotator cuff tears. Am J Transl Res. 2024;16:147–54. https://doi.org/10.62347/XRAF6615.
Ito Y, Ishida T, Matsumoto H, Yamaguchi S, Ito H, Suenaga N, et al. Factors associated with subjective shoulder function preoperatively and postoperatively after arthroscopic rotator cuff repair. Jses Int. 2024;8:1207–14. https://doi.org/10.1016/j.jseint.2024.07.008.
Stem cell treatment for regeneration of the rotator cuff. study protocol for a prospective single-center randomized controlled trial (lipo-cuff)| trials| full text [Internet]. [cited 2025 Mar 2]. Available from: https://trialsjournal.biomedcentral.com/articles/https://doi.org/10.1186/s13063-024-08557-0
Kim DM, Seo J-S, Jeon I-H, Cho C, Koh KH. Detection of rotator cuff tears by ultrasound: how many scans do novices need to be competent? Clin Orthop Surg. 2021;13:513–9. https://doi.org/10.4055/cios20259.
Wang X, Sanders HM, Liu Y, Seang K, Tran BX, Atanasov AG, et al. ChatGPT: promise and challenges for deployment in low- and middle-income countries. Lancet Reg Health - West Pac. 2023;41:100905. https://doi.org/10.1016/j.lanwpc.2023.100905.
Zhang C, Wu J, Li X, Wang Z, Lu WW, Wong T-M. Current biological strategies to enhance surgical treatment for rotator cuff repair. Front Bioeng Biotechnol. 2021;9:657584. https://doi.org/10.3389/fbioe.2021.657584.
Artificial intelligence meets. medical robotics| science [Internet]. [cited 2025 Mar 2]. Available from: https://www.science.org/doi/10.1126/science.adj3312
Huang J, Lin R, Bai N, Su Z, Zhu M, Li H, et al. Six-month follow-up after recovery of COVID-19 delta variant survivors via CT-based deep learning. Front Med (Lausanne). 2023;10:1103559. https://doi.org/10.3389/fmed.2023.1103559.
Weber MT, Noll R, Marchl A, Facchinello C, Grünewaldt A, Hügel C, et al. MedBot vs realdoc: efficacy of large Language modeling in physician-patient communication for rare diseases. J Am Med Inf Assoc. 2025;ocaf034. https://doi.org/10.1093/jamia/ocaf034.
large language model framework for literature-based disease–gene association prediction| briefings in bioinformatics| oxford academic [Internet]. [cited 2025 Mar 2]. Available from: https://academic.oup.com/bib/article/26/1/bbaf070/8042066?login=false
Weber S, Chahal J. Management of rotator cuff injuries. Jaaos - J Am Acad Orthop Surg. 2020;28:e193. https://doi.org/10.5435/JAAOS-D-19-00463.
肩袖损伤中西医结合诊疗指南(2023年版) [Internet]. [cited 2025 Mar 2]. Available from: https://kns.cnki.net/nzkhtml/xmlRead/trialRead.html?dbCode=CJFD%26tableName=CJFDTOTAL%26fileName=ZYZG202401001%26fileSourceType=1%26invoice=N9yU7WRVyJ%252fr9tsenFmTaCMBEdsc46QhRXbN968QCZIJa4JP7FyhpblMWNDBBuKu1%252f1PcwdrxdOJZdmtTV8AquVtfQsflmfjRmGS4SryJHNjiHeLeALhk65XVe0R7h4fd7zRuIBBMBBOH29WhV0Zn1dwwkemG2DNw0TKJbZ2Who%253d%26appId=KNS_BASIC_PSMC
Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun C-H, Lam JSH, et al. Benchmarking large Language models’ performances for myopia care: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google bard. Ebiomedicine. 2023;95:104770. https://doi.org/10.1016/j.ebiom.2023.104770.
Zhu S, Zhang M, Guo D. Automatic prediction of text readability for international Chinese language education. In: Proceedings of the 2024 International Conference on Innovation in Artificial Intelligence New York, NY, USA: Association for Computing Machinery, 2024 [cited 2025 Mar 2].pp. 65–71. (ICIAI ’24). Available from: https://doi.org/10.1145/3655497.3655525
Sohn E. The reproducibility issues that haunt health-care AI. Nature. 2023;613:402–3. https://doi.org/10.1038/d41586-023-00023-2.
Denecke K, May R, LLMHealthGroup, Romero OR. Potential of large Language models in health care: Delphi study. J Med Internet Res. 2024;26:e52399. https://doi.org/10.2196/52399.
Comment on. benchmarking the performance of large language models in uveitis: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, google gemini, and anthropic Claude3| eye [Internet]. [cited 2025 Mar 2]. Available from: https://www.nature.com/articles/s41433-025-03736-y
Comparing new tools of artificial intelligence. to the authentic intelligence of our global health students| BioData mining| full text [Internet]. [cited 2025 Mar 2]. Available from: https://biodatamining.biomedcentral.com/articles/https://doi.org/10.1186/s13040-024-00408-7
Pordzik J, Bahr-Hamm K, Huppertz T, Gouveris H, Seifen C, Blaikie A, et al. Patient support in obstructive sleep Apnoea by a large Language model – ChatGPT 4o on answering frequently asked questions on first line positive airway pressure and second line hypoglossal nerve stimulation therapy: A pilot study. Nat Sci Sleep. 2024;16:2269–77. https://doi.org/10.2147/NSS.S495654.
Acknowledgements
None.
Funding
Scientific and Technological Innovation Action Plan Medical Innovation Research Project of Shanghai Science and Technology Committee, Shanghai 21Y11911400. Shanghai Changning District Medical Master and Doctoral Innovation Talent Base Project (RCJD2022S04).
Author information
Authors and Affiliations
Contributions
WYL was responsible for the conception and the acquisition of the literature for the manuscript. WYL wrote the original draft of the manuscript.TLC and ZQ prepared figures. HY, NZX and ZJC provided the initial assessment score and supervised the implementation of the project. TXY, WWR, MJY and DDF reviewed and edited. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study comprises two phases: an initial retrospective cross-sectional analysis for testing, followed by a real-world validation study. Conducted between January 9 and March 5, 2025, at the Department of Joint Orthopedics, Guanghua Integrated Chinese and Western Medicine Hospital in Shanghai, China, the research received approval from the hospital’s Ethics Committee (2025-K-29) and adhered to the principles of the Declaration of Helsinki. All participants provided written informed consent before participation.
Consent for publication
Not Applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Wang, YL., Tian, LC., Meng, JY. et al. Evaluation of large language models in patient education and clinical decision support for rotator cuff injury: a two-phase benchmarking study. BMC Med Inform Decis Mak 25, 289 (2025). https://doi.org/10.1186/s12911-025-03105-5
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1186/s12911-025-03105-5







