Introduction

Multidisciplinary treatment (MDT) is a collaborative approach that involves specialists from various fields to manage complex medical conditions. MDT can integrate diverse expertise and improve communication among healthcare professionals, leading to better-coordinated care [1].

MDT first emerged in the field of oncology [2], where it has been shown to enhance clinical team competence, promote patient safety, and shorten the time from diagnosis to treatment, thereby improving overall treatment efficiency [3, 4]. This collaborative approach is now also widely applied in other complex medical scenarios, such as critical care medicine. Studies have suggested that MDT in the intensive care unit (ICU) decreases the length of stay and mortality rates [5], lowers the Sequential Organ Failure Assessment (SOFA) scores in infected patients [6], and optimizes antimicrobial treatments [7].

Effective MDT need ample resources and time for high-quality communication and decision-making. However, limited resources and tight schedules [8], combined with varying professional cultures and practices [9], often lead to rushed or suboptimal decisions. Variability in MDT meetings, due to a lack of standardized protocols for patient assessment and decision-making, leads to inconsistent care quality and patient outcomes [1]. Frequent communication breakdowns, especially when team members are not co-located or insufficient time is allocated for discussions, result in incomplete or inaccurate information sharing, further impacting decision-making [1, 8].

The advent of artificial intelligence (AI) has brought transformative changes to the field of medicine [10]. Among the most promising AI advancements are Large Language Models (LLMs), which have demonstrated remarkable capabilities in processing and understanding complex medical information [11]. As a standout among LLMs, ChatGPT [12], developed by OpenAI, is one of the most widely used general-purpose LLMs. It has been trained on an extensive corpus of general-domain data and demonstrates strong performance in open-domain question answering and text generation. Prior studies have shown that it also performs reasonably well in responding to medical questions [13, 14] and assisting with diagnosis and treatment [15] across various disciplines. It has also shown potential in clinical case analysis and disease prediction [16]. However, the extent to which LLMs can replicate or complement the nuanced judgment and collaborative dynamics of human doctors in MDT settings remains an area yet to be thoroughly researched.

This study aimed to assess the capabilities of ChatGPT in supporting clinical MDT decision-making, given its accessibility and widespread adoption in clinical and academic contexts. By comparing the MDT recommendations provided by ChatGPT with those of physicians, this study sought to evaluate the strengths and limitations of this general-purpose model and explore its potential utility in real-world clinical scenarios, thereby offering preliminary insights into the potential role of LLMs in future multidisciplinary medical decision-making processes.

Methods

Setting and population

This retrospective study included all adult patients admitted to the ICU of Shanghai East Hospital who underwent MDT consultations between January 1, 2023, and December 31, 2023. For patients who had multiple MDT consultations, only the earliest MDT record was included. MDTs conducted solely for the assessment of surgical indications were excluded. Comprehensive clinical data—including medical history, treatment details, laboratory results, and imaging reports in textual form—were extracted from the electronic medical record. All data were fully de-identified to ensure patient privacy. Identical clinical summaries were used as input for both ChatGPT and the MDT physicians. For each patient, we gathered MDT consultation records, detailing the participating departments, the individual physicians involved, and the specific opinions provided by each physician. This study received ethical approval (Institutional Review Board approval number: [2024YS-82]). Informed consent was not required.

Model setting

ChatGPT, developed by OpenAI, is a state-of-the-art LLM based on the transformer architecture. It processes and generates human-like text by leveraging extensive pre-training on diverse datasets. This study selected the most advanced version, ChatGPT-4o, using the default prototype without any additional fine-tuning or domain-specific training, in order to evaluate its baseline performance as a general-purpose model. This minimal-input, no-tuning setup was chosen to provide a transparent and generalizable assessment of ChatGPT’s capabilities in real-world clinical contexts. While such design imposes constraints on performance, it serves as a reference point for future research involving prompt engineering or domain-specific adaptation.

Each patient’s case was presented to the model in a new dialogue session to avoid the influence of any prior information on the model’s output. Due to the collected text data being written in Chinese, 1 physician translated all relevant texts into English. The translations were then reviewed by an expert physician with 10 years of clinical experience. All text provided to ChatGPT was in English.

A standardized prompt was used in all sessions to instruct the model to wait until all necessary information was provided before performing any analysis. Subsequently, we sequentially provided the patient’s medical history and examination results and then requested the model to generate specific MDT recommendations. Notably, the prompt did not specify particular departments, nor did it explicitly ask for diagnoses or treatment suggestions; instead, it uniformly requested “MDT opinions.” The full prompt is provided in Supplementary Material 1. The study flowchart is provided in Supplementary Material 2.

Data evaluation

This study compared the MDT recommendations generated by ChatGPT with those recorded from actual clinical MDT discussions, across five evaluation dimensions: comprehensiveness, accuracy, feasibility, safety, and efficiency. Each dimension was assessed using two predefined criteria, resulting in a total of 10 evaluation items (Table 1). These criteria were designed in advance to reflect key attributes of high-quality MDT decision-making.

For each patient case, both the ChatGPT-generated response and the MDT physician-generated MDT record were evaluated independently. The two responses were presented side by side in anonymized form, with their order randomized to reduce potential recognition based on content style. Two clinical evaluators (attending physicians, each with over five years of clinical experience and prior involvement in multidisciplinary clinical decision-making) rated each response separately using a 5-point Likert scale for all 10 items, where 5 indicated “very consistent with the evaluation criterion” and 1 indicated “very inconsistent.” Each response therefore received 10 individual scores, yielding a total score ranging from 10 to 50. These total scores were computed separately for each case and each source, and were used for statistical comparison between the two groups. In this context, the term “consistency” refers to the degree to which each response—whether from ChatGPT or MDT physicians—aligned with the specific content and expectations of each evaluation item.

The evaluations were conducted in a blinded manner: raters were unaware of the source of each response and completed the full set of ratings for one case before proceeding to the next. No explicit comparison between the two responses was requested during the rating process.

Table 1 Evaluation criteria and questions for MDT recommendations

Statistical methods

The data were compiled and summarized in Microsoft Excel 2021, version 2403 (Microsoft Corp., Washington, DC, USA), with calculations for mean, median, quartiles, and standard deviation. Statistical analyses were subsequently conducted using IBM SPSS Statistics, version 25 (International Business Machines Corp., Armonk, NY, USA). The normality of data was assessed with the Kolmogorov-Smirnov test. For data not meeting normality assumptions, Chi-square test and Mann–Whitney U Test were employed to compare the 2 groups’ scores. The significance threshold was set at p < 0.05. OriginPro, version 2024 (OriginLab Corp., Northampton, MA, USA) was used for data visualization.

Results

Overview

A total of 64 patient cases were included in this study. The median score for MDT recommendations provided by ChatGPT was 41.0 out of 50, significantly lower than the median score of 43.5 for the recommendations given by the MDT physicians (p = 0.001) (Table 2). Figure 1 illustrates the distribution of total MDT scores for both sources. In evaluating all questions, there was a statistically significant difference in the number of “consistent” or higher ratings between the physicians’ recommendations and those provided by ChatGPT (529 vs. 493, p = 0.030). Additionally, the number of “very inconsistent” ratings was significantly lower for the physicians compared to ChatGPT (2 vs. 12, p = 0.012).

Table 2 Descriptive statistics of physicians’ and ChatGPT’s assessments
Fig. 1
figure 1

Distribution of total scores for physicians and ChatGPT. Scores were derived from 64 patient cases. Two independent evaluators assessed the responses provided by MDT physicians and ChatGPT (**p < 0.01)

Performance analysis across 5 aspects

In the comparison across the 5 aspects, ChatGPT and the MDT physicians each demonstrated strengths in different aspects. 86.7% of ChatGPT’s responses were rated as “very consistent” in comprehensiveness compared to 65.6% for physicians. In contrast, physicians achieved 75.0% “very consistent” in feasibility compared to ChatGPT’s 53.9% (Fig. 2). ChatGPT significantly outperformed the physicians in the comprehensiveness of the recommendations (median [IQR]: 5.0 [4.6–5.0] vs. 4.5 [3.5–5.0], p < 0.001). The physicians’ recommendations showed statistically significant superiority over ChatGPT’s in terms of accuracy (median [IQR]: 4.5 [4.0–5.0] vs. 4.0 [3.0–4.5], p < 0.001), feasibility (median [IQR]: 5.0 [4.5-5.0] vs. 4.5 [3.5–4.5], p < 0.001), and efficiency (median [IQR]: 4.5 [3.9-5.0] vs. 4.3 [3.5–4.5], p = 0.016) (Fig. 3). No significant difference was observed in the safety domain (median [IQR]: 4.0 [3.5–4.5] vs. 4.0 [3.5–4.5], p = 0.613). A detailed comparison is provided in Supplementary Material 3, Table 1.

Fig. 2
figure 2

Consistency levels of MDT physicians vs. ChatGPT. The stacked bar chart illustrates the distribution of consistency levels between the generated responses and the evaluation criteria across 5 aspects

Fig. 3
figure 3

Comparison of MDTphysicians and ChatGPT performance across 5 aspects (*p < 0.05, ***p < 0.001)

Performance analysis across 10 questions

We evaluated ten specific questions across five key dimensions of MDT quality: comprehensiveness (Q1–Q2), accuracy (Q3–Q4), feasibility (Q5–Q6), safety (Q7–Q8), and efficiency (Q9–Q10). In these item-level assessments, ChatGPT’s performance on Q2 was significantly better than that of the MDT physicians (4.59 vs. 3.94, p < 0.001). However, ChatGPT scored the lowest on Q3, significantly below the physicians’ scores (3.44 vs. 4.53, p < 0.001). The physicians’ responses received the highest average score on Q9 (4.88) and the lowest on Q10 (3.72). Notably, the physicians’ responses on Q3, Q6, and Q9 showed statistically significant differences compared to ChatGPT (p < 0.001) (Fig. 4). A detailed comparison is provided in Supplementary Material 3, Table 2.

Fig. 4
figure 4

Comparison of MDT physicians and ChatGPT performance across 10 questions (***p < 0.001)

Discussion

Major findings

This study explored the efficacy of ChatGPT in generating MDT recommendations for patients in the ICU setting. The findings revealed both strengths and limitations of using ChatGPT in a clinical context compared to traditional physician-led MDTs.

One of the notable strengths of ChatGPT was its superior performance in the comprehensiveness of MDT recommendations. ChatGPT’s ability to incorporate all relevant specialties and provide detailed opinions from each discipline was highlighted by the significantly higher ratings it received in this domain. Importantly, this result was not driven by a prompt specifically requesting comprehensive output—only a general instruction to generate MDT recommendations was provided. The comprehensiveness likely stems from ChatGPT’s inherent response tendencies, such as its structured, generalized approach to answering questions and its capacity to simulate textbook-style medical reasoning. These features allow the model to consistently address multiple specialties without being constrained by the time or cognitive limitations that may affect human clinicians. However, it is important to interpret this comprehensiveness in the context of other dimensions—such as accuracy and feasibility—when evaluating ChatGPT’s overall clinical utility.

However, despite its strengths in comprehensiveness, ChatGPT’s performance in accuracy was significantly lower than that of the MDT physicians. Physicians demonstrated a remarkable ability to infer potential underlying causes based on presenting symptoms and test results. This discrepancy underscores the importance of clinical experience and nuanced understanding of patient-specific factors, which the model lacks. This phenomenon has been noted in previous studies as well [17]. In addition, physicians, with their direct patient interactions, have the advantage of conducting real-time observations and interactions with patients. They can access more extensive imaging data and perform hands-on examinations to gather additional information.

Besides its shortcomings in accuracy, ChatGPT also lagged behind physicians in the feasibility of its recommendations. Physicians, with their extensive clinical experience, demonstrated superior ability in providing feasible recommendations that could be effectively integrated into patient care, navigating potential conflicts among different specialties, and balancing comprehensive care with practical constraints. While ChatGPT’s suggestions were often comprehensive, they lacked the practical applicability and nuanced understanding necessary for seamless clinical implementation. Additionally, the phenomenon of hallucinations in large language models is a critical factor to consider. Some studies have shown that LLMs can provide high-fidelity summaries of the information they receive, but errors can still occur [18], which further compromising the feasibility of ChatGPT’s recommendations. Equally important is the fact that ChatGPT, like many other LLMs, is not trained specifically on medical datasets but on a broad range of data. This general training leads to issues with the reliability of the generated results when applied in the medical field [19].

In terms of efficiency, ChatGPT’s overall score was lower than that of the physicians. This discrepancy may be attributed to ChatGPT’s tendency to include a broad range of specialties in the MDT, encompassing departments that may not be immediately relevant to the patient’s primary issues. However, this study found that, from a resource optimization perspective, ChatGPT scored higher than the physicians, though this difference was not statistically significant. Upon reviewing both the scores for Question 10 (“Do the recommendations optimize resource use?”) and the content of the responses, we found that physicians were more inclined to recommend additional or higher-cost procedures—often aimed at achieving more accurate diagnoses or pursuing advanced treatment options. In contrast, ChatGPT suggested fewer such extensive and expensive tests, potentially indicating a more conservative approach to resource utilization. This finding suggests that, while ChatGPT’s recommendations may lack some of the nuanced applicability seen in physician-led decisions, its approach could contribute to more efficient use of medical resources by minimizing unnecessary diagnostic interventions.

Regarding safety, the scores for ChatGPT and the physicians did not show a statistically significant difference. In our study, safety was assessed independently of diagnostic accuracy, based on two specific criteria: whether the diagnostic and treatment plans were minimally invasive, and whether potential risks and side effects were addressed with preventive measures. Although ChatGPT demonstrated lower accuracy, its recommendations were generally conservative and avoided high-risk or invasive interventions, which may explain the comparable safety scores. It is important to clarify that this evaluation reflects a narrowly defined aspect of safety within the scoring framework and should not be interpreted as suggesting that the overall use of ChatGPT in MDT settings is safe. We have highlighted this distinction to avoid potential overgeneralization.

Considerations and challenges for LLMs in healthcare

Based on their extensive training data and exceptional language generation capabilities, LLMs like ChatGPT are increasingly being explored for applications in the medical field. Studies have demonstrated the potential of LLMs across various disciplines, suggesting promising applications in healthcare. However, both research and practical use have highlighted concerning limitations of LLMs. One significant issue is the variability of their output, which is incompatible with the precision required in medical practice. Additionally, the phenomenon of hallucinations in LLMs can result in the generation of convincing but incorrect information. Moreover, challenges related to data usage and privacy protection, as well as inherent biases in the generated results [20, 21], pose significant obstacles to the seamless integration of LLMs into clinical activities.

The results of this study reflect several of these challenges, particularly the lower accuracy and feasibility observed in ChatGPT’s responses compared to physicians. While the model demonstrated strong comprehensiveness, this alone is not sufficient to determine clinical usefulness. Therefore, any evaluation of LLMs in healthcare must be multidimensional and approached with appropriate caution. In addition to the technical limitations of LLMs, their application in clinical decision-making is further constrained by the inherent complexity of real-world medical contexts. Clinical decisions often require the integration of patient-specific factors beyond what is captured in structured records, including institutional capabilities, socioeconomic conditions, cultural values, and post-discharge support systems. These dimensions, essential to effective multidisciplinary care, remain beyond the scope of current LLMs. Furthermore, the quality of LLM-generated recommendations is highly dependent on the input prompts, raising concerns about the consistency and safety of outputs, particularly when used without expert oversight. These considerations reinforce the view that LLMs should serve as supplementary tools rather than autonomous agents in clinical workflows.

To harness the full potential of LLMs in the medical field, future efforts could focus on specialized training using comprehensive medical datasets to create medical-LLMs, thereby improving accuracy and relevance in clinical applications. Google has developed Med-Gemini, a medical-specific LLM that achieved the best performance on 10 out of 14 healthcare benchmarks [22]. Additionally, integrating LLMs with clinical workflows, ensuring rigorous validation and human oversight, and addressing ethical and privacy considerations are crucial. Enhancing algorithms to reduce variability and hallucinations, and fostering interdisciplinary collaboration will further facilitate the effective and safe use of LLMs in healthcare. These steps will significantly enhance the accuracy, feasibility, and safety of LLM applications in medicine.

Limitations

This study has several limitations. To avoid response bias from multi-turn interactions, we instructed ChatGPT to review each patient’s medical history and provide a single, integrated MDT recommendation, without prompting it for further elaboration or requesting more accurate or complete responses. While this approach simulates how an untrained clinician might interact with such models in real-world settings, it may have led to an underestimation of ChatGPT’s full capabilities. Additionally, our study adopted a retrospective design, in which physician MDT recommendations were made through real-time, face-to-face consultations with patients. Despite our efforts to provide ChatGPT with comprehensive clinical information, it still faced an inherent information gap. A prospective study could help ensure consistent informational access between ChatGPT and physicians. Furthermore, the ten evaluation questions used to assess both ChatGPT and the physicians were self-designed and may not fully capture the breadth of clinical reasoning. Each of the five dimensions was assessed using only two items, which could introduce bias or narrow interpretations. Although raters were blinded to the source of each response, we acknowledge that the distinctive linguistic style of ChatGPT may have allowed them to infer its origin, potentially introducing subtle bias in the evaluation. Future studies should consider more standardized and quantifiable evaluation frameworks to enable more comprehensive and objective comparisons.

Conclusion

LLMs demonstrate broad potential applications across various medical scenarios. In this study, ChatGPT was able to respond to complex MDT cases and analyze current patient issues, providing comprehensive suggestions for MDT participation. It showed potential in assisting physicians in decision-making to some extent. However, it is undeniable that current clinical applications of LLMs still face challenges related to accuracy and feasibility. Further research is needed to assess the capabilities and limitations of LLMs in different clinical settings. Carefully leveraging LLMs and critically evaluating their recommendations will be necessary to improve their reliability for clinical use.