Introduction

Artificial intelligence (AI) is poised to play a pivotal role in reshaping healthcare delivery and advancing patient outcomes. A significant role for AI in healthcare is the provision of medical information, facilitating patient education in a personalized and interactive manner [1]. Large Language Models (LLMs), a subset of AI programs, offer substantial potential for enhancing preanesthetic patient education and communication [2]. LLMs are trained using massive datasets, which render them capable of generating coherent and realistic human-like text responses based on patient-entered text prompts [3, 4]. Among the available LLM interfaces, deep learning tools with natural language processing like ChatGPT (OpenAI, San Francisco, CA, USA) and Google Gemini (Google AI, California, United States) have garnered significant interest and have already been evaluated for their overall capability to answer patient-related questions accurately in fields like ophthalmology, otolaryngology, neurosurgery, cardiology, orthopedics, and even in Medical Licensing Examinations [5,6,7,8,9].

Recent research indicates that inadequate patient counselling regarding anesthesia and surgery, postoperative course and the presence of negative beliefs and thoughts are strongly associated with an increased risk of psychiatric disorders. Patients require comprehensive information about the type of anesthesia, along with personalized details relevant to their condition and concerns, which helps dispel misconceptions, provides greater reassurance and ensures compliance with preoperative instructions [10,11,12]. A previous study reported that specialists in anesthesia care found ChatGPT-4 generated responses to common preoperative questions comparable to the information present on standard academic websites [5]. However, LLMs can sometimes generate inaccurate or misleading answers, known as “hallucinations,” which can undermine their reliability [4]. It remains unclear whether ChatGPT or Google Gemini can reliably understand and respond to questions about standard anesthesia care in real clinical settings, which typically demand extensive training and expertise from anesthetists. Hence, we conducted this study to compare the feasibility and utility of ChatGPT and Google Gemini in preanesthetic counselling and education of the patients. The primary objective was to compare the quality of the content generated by the two models, with secondary objectives including a comparison of readability and sentiment analysis of responses.

Methodology

This prospective observational study was conducted at a tertiary care hospital from September to November 2024 after approval of the Institutional Research Committee (AIMS/IRC-study 12/2024). As the study did not involve direct participation of patients, and given its nature, formal ethics committee approval was not required, and a waiver was obtained (AIMS/IEC-HR/2025/54). The research was carried out in accordance with ethical standards, with strict confidentiality and anonymity maintained. Consent by participating anesthesiologists was implied, and a detailed participant information sheet was provided (Supplementary File 1).

Selection of questions

A confidential, web-based questionnaire was distributed to anesthesiologists at the primary tertiary academic institution to collect commonly encountered preanesthetic questions from patients scheduled for laparoscopic cholecystectomy. A total of 22 anesthesiologists participated, submitting 68 unique questions. These were stratified into the following major domains: preoperative (evaluation, risk stratification, choice of anesthesia, preoperative instructions), intraoperative (management of awareness and pain, and expected complications), and postoperative (resumption of dietary intake, ambulation, complications, and follow-up). The expert panel, comprising senior study authors from academic institutions, employed a structured voting process to evaluate the questions and shortlisted 13 questions. Each expert independently rated the relevance and frequency of each question. Questions were then discussed collectively, and a consensus was required for a question to be included in the final set. To simulate real-world patient behaviour, submitted queries were reviewed in relation to commonly asked questions on laparoscopic cholecystectomy from patient-focused resources on hospital websites to ensure relevance and coverage [13,14,15]. Questions were reworded into simplified language reflecting typical layperson phrasing and adjusted to meet a Flesch Reading Ease Score (FRES) of ≥ 60, without additional system prompts or constraints, to simulate real-world patient queries where instructions such as audience level or response length are not typically specified. To ensure topic diversity, at least one question per domain was retained. Duplicates and infrequently asked questions lacking consensus were excluded. The final set comprised 13 questions (Table 1) representing the most commonly asked patient concerns prior to laparoscopic cholecystectomy.

Table 1 Thirteen patient questions used for evaluation, stratified by domain

Collection and grading of responses

Responses to the 13 selected questions were generated separately using the free web interface of ChatGPT (OpenAI, San Francisco, CA; version labelled ‘GPT-4’ in the interface at the time of access) via chat.openai.com and Gemini 1.5 (Google AI, California, USA) via gemini.google.com in October 2024, using a new session for each query. The free consumer-facing interfaces do not provide model IDs, build dates, or control over generation parameters (e.g., temperature, top_p, max tokens, safety configuration); therefore, all outputs reflect the default system settings as implemented by the providers at the time of data collection.

All standardized simulated interactions were conducted by a single author (A.R.) using the same device (MacBook Pro, Apple Inc., Cupertino, CA), web browser (Google Chrome, Google Inc., Mountain View, CA), and internet connection. To minimize bias, no “alternate answers” were requested at any point. Before each session, all chat windows were closed, browser history and cookies were cleared, and the computer was restarted to eliminate any potential influence of locally stored data on the chatbot’s performance. This approach was adopted because LLMs generate non-identical outputs upon repeated prompting, which may influence quality and accuracy.

Survey participants included anesthesiologists with a minimum of five years of clinical experience in academic medical institutions across India. All participants were actively involved in preoperative assessment and optimization of surgical patients. The paired responses from ChatGPT and Google Gemini were compiled into a survey form and distributed via email. To minimize the potential for recognition-based bias, the order of responses was randomized for each participant, and all the participants were blinded to group allocation.

Participants were instructed to evaluate each response on four parameters: accuracy, comprehensiveness, clarity, and safety, using a 5-point Likert scale (1 = very poor, 5 = excellent). Accuracy was defined as the degree to which the content reflected the latest evidence-based medical consensus. Comprehensiveness referred to the extent of topic coverage, including all necessary and relevant details. Clarity was judged based on how clearly and precisely the information was presented. Safety was defined as the absence of recommendations that could potentially cause harm or mislead patients.

In addition to expert grading, the readability of each response was assessed using established scores using a web-based readability checker tool (https://www.readable.com). The indices included the Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI) and the FRES. FKGL assesses the grade level required to comprehend a text based on sentence length and syllable count [16]. The FRES evaluates the ease of reading, focusing on the number of words per sentence and the number of syllables per word [17]. The GFI estimates the number of years of schooling needed to comprehend a given text, emphasizing complex words and sentence length rather than individual words [18]. We also assessed the Coleman-Liau index, which determines readability by analysing the average number of letters per word and words per sentence, rather than syllables and the SMOG Index (Simple Measure of Gobbledygook), which predicts the ability to understand a text by counting polysyllabic words in a sample of sentences [19, 20]. In addition to this, the tone and emotional expression of the generated responses were also assessed using established methods.

Statistical analysis

As this was an exploratory study with no prior data available to estimate effect sizes for LLM evaluation in preanesthetic counselling, no a priori power calculation was performed. Instead, all available anesthesiologist raters at the study site during the study period were included.

Survey responses were exported from Google Forms and compiled using Microsoft Excel (version 16.76; Stanford, CA, USA). Statistical analyses were performed using R (version 4.5.0; R Foundation for Statistical Computing, Vienna, Austria) [21]. Descriptive means with 95% confidence intervals were calculated in two ways: (i) by question: computing the mean rating for each question across raters, then averaging across questions within each domain and model; and (ii) by rater: computing the mean rating for each rater across questions, then averaging across raters within each domain and model. This dual approach allowed descriptive evaluation from both question-centered and rater-centered perspectives.

To compare ratings between models, we used cumulative link mixed models (ordinal regression) with response score (1–5 Likert) as the outcome, model (ChatGPT vs. Gemini) as a fixed effect, and random intercepts for both rater and question to account for the repeated-measures design. Results are reported as odds ratios (OR) with 95% confidence intervals (CI), where OR > 1 indicates higher odds of a better score for ChatGPT compared to Gemini. P-values were adjusted for multiple testing across the four evaluation domains using the Benjamini–Hochberg procedure.

To assess robustness, we conducted sensitivity analyses. Mixed-effects ordinal regression models were re-estimated using (i) leave-one-rater-out and (ii) leave-one-question-out approaches, as well as with an alternate probit link function. These analyses tested whether results were influenced by individual raters, individual questions, or link specification.

Interrater reliability was assessed using Krippendorff’s α with an ordinal metric, appropriate for multi-rater Likert-scale data. Confidence intervals were obtained by bootstrapping across items. A P value of < 0.05 was considered statistically significant.

Full-text responses generated by both LLMs were subjected to sentiment analysis. The Bing lexicon was used to classify words as positive or negative, and net sentiment scores were calculated by subtracting negative from positive word counts. The NRC Emotion Lexicon was applied to quantify words associated with ten emotion categories (anger, anticipation, disgust, fear, joy, sadness, surprise, trust, positive, negative). Word frequencies were normalized and visualized using bar and radar plots. For both Bing and NRC analyses, text was tokenized at the word level using the tidytext package; stop words were retained, as neutral words do not affect classification. Sentence-level sentiment scoring was performed using the sentimentr package [22], which tokenizes by sentences and accounts for contextual modifiers such as negation (for instance: “not good”). For lexicon-based analyses, each full LLM response was the unit of analysis, whereas for sentence-level scoring, results were aggregated from individual sentences to the response level.

Results

A total of 20 anesthesiologists completed the evaluation survey, assessing the quality of responses generated by ChatGPT and Google Gemini across the four key domains (Fig. 1). In addition, linguistic complexity was evaluated using multiple standardized readability indices. Full transcripts of all model responses are provided in supplementary file 1.

Fig. 1
figure 1

Schematic representation of the evaluation workflow for LLM-generated responses

Evaluation of content quality

Descriptive mean scores with 95% CI are presented in Table 2. ChatGPT received higher ratings than Gemini in accuracy (both by-question and by-rater) and comprehensiveness (both by-question and by-rater). Clarity and safety ratings were similar, with overlapping confidence intervals in both approaches.

Table 2 Descriptive mean likert scores (1–5) with 95% confidence intervals for ChatGPT and gemini across evaluation domains, calculated by-question and by-rater

Mixed-effects ordinal regression showed that ChatGPT had higher odds of receiving better scores for accuracy (OR 2.32, 95% CI 1.62–3.32, p < 0.001) and comprehensiveness (OR 2.38, 95% CI 1.67–3.37, p < 0.001) compared to Gemini. No significant differences were found for clarity (OR 1.05, 95% CI 0.75–1.47, p = 0.78) or safety (OR 1.01, 95% CI 0.72–1.43, p = 0.94). These results are summarized in Fig. 2, which presents a forest plot of OR with 95% confidence intervals for each evaluation domain.

Fig. 2
figure 2

Forest plot of mixed-effects ordinal regression comparing ChatGPT vs. Gemini across four evaluation domains. Points represent odds ratios (OR) with 95% confidence intervals on a log scale; vertical dashed line at OR = 1 indicates no difference. Domains marked with “#” are statistically significant after Benjamini–Hochberg adjustment

Sensitivity analyses confirmed the robustness of the findings. The OR from leave-one-rater-out and leave-one-question-out analyses remained within the 95% CIs of the primary models, and results were consistent using a probit link function. Full sensitivity results are presented in Supplementary File 2.

Interrater agreement analysis

Agreement among raters was generally low across all domains. Krippendorff’s α (95% CI) was 0.24 (0.18–0.35) for accuracy, 0.23 (0.19–0.33) for comprehensiveness, 0.37 (0.32–0.46) for clarity, and 0.46 (0.41–0.55) for safety, indicating only slight to fair reliability.

Evaluation of readability and linguistic complexity

To assess how easily patients could understand the content, responses were analyzed using established readability formulas. Table 3 compares the readability indices of ChatGPT and Google Gemini across five standardised metrics. ChatGPT generated more complex text, requiring a significantly higher reading level as measured by the FKGL (P = 0.04). Correspondingly, Google Gemini produced more readable content based on the FRES (P = 0.04), where higher scores indicate easier comprehension. Other indices corroborated this trend, though without reaching statistical significance. ChatGPT had higher values on the GFI and SMOG Index, while the Coleman–Liau Index was comparable between models. The SMOG Index approached significance (P = 0.056).

Table 3 Comparison of readability metrics between ChatGPT and Google gemini responses

Despite differences in complexity, there were no significant variations in the length of responses. Word count, character count, and syllable count were comparable between the two models.

Sentiment and emotional tone analysis

Figure 3A and B illustrate the results of sentiment and emotional tone analysis. Bing lexicon analysis showed both models used more negative than positive words, with net sentiment scores of − 46 for ChatGPT and − 73 for Gemini. Radar plot analysis using the NRC Emotion Lexicon revealed that Google Gemini responses also had greater emotional diversity/variance, with higher frequencies of words associated with trust, joy, sadness, disgust amongst others, while ChatGPT responses showed comparatively fewer emotion laden words, except for anger. In Fig. 3B, the 0–1 scale simply shows the relative intensity of each emotion category, with values closer to 1.0 meaning that emotion was expressed more strongly in the responses. Sentence-level sentiment scoring (sentimentr) yielded mean values close to neutral for both models, though ChatGPT responses were marginally more positive (+ 0.109 vs. + 0.023). A summary of all three sentiment analyses is presented in Table 4. The R code used to implement these analyses has been provided as Supplementary File 3.

Fig. 3
figure 3

Sentiment and emotional tone analysis of LLM responses A Stacked bar chart based on the Bing sentiment lexicon; B Radar chart based on the NRC emotion lexicon (The 0–1 scale indicates the relative intensity of each emotion, with values closer to 1.0 meaning that emotion appeared more strongly in the model’s responses)

Table 4 Quantitative sentiment comparison of ChatGPT and Google gemini responses across multiple emotional and polarity metrics

To complement quantitative sentiment scores, we visualized the most frequently used emotion-related words using word clouds. Low-value or contextually irrelevant terms were excluded using a custom stopword filter to improve interpretability. Supplementary file 4 is a figure depicting the Word cloud of the top 100–150 emotion-bearing words in ChatGPT and Google Gemini responses. Words are sized proportionally to their frequency, highlighting the most emotionally expressive vocabulary used by the model.

Discussion

In this study, we provide novel evidence comparing the performance of two LLMs, ChatGPT and Google Gemini, in generating appropriate and safe preanesthetic educational content for patients undergoing laparoscopic cholecystectomy. While prior literature has focused on the ability of the LLM’S to answer medical exam-style questions with factual accuracy [23], our study shifts the focus to how well these models communicate medical information in a way that is accurate, clear, comprehensive, and suitable for patients. Our findings highlight two key patterns. First, ChatGPT outperformed Gemini in content-related domains such as accuracy and comprehensiveness, as rated by experienced anesthesiologists, similar to some other studies [24]. Second, Gemini generated responses with better readability, producing text that was generally easier to understand. This reflects a trade-off between delivering medically detailed content and using language that is more accessible to the average patient.

Although our study demonstrated that ChatGPT excelled in domains of accuracy and comprehensiveness, it is essential to acknowledge the fundamental differences between the two models. The version of ChatGPT used was trained on a large dataset up to a fixed cut-off date, whereas Gemini incorporates real-time web access, potentially allowing for more up-to-date information [25, 26]. This may explain why some studies have reported that ChatGPT struggles with consistently providing guideline-concordant clinical recommendations [27]. However, our study was not designed to assess how current or updated the responses were, but to assess the utility of LLMs for standardized preoperative education. Differences in model architecture and training approaches may account for this observed performance variability [28].

Previous studies evaluating LLMs in patient-facing contexts have generally supported their potential in enhancing health communication. For instance, Segal et al. assessed GPT-4 responses to common preanesthetic questions and found them to be clearer and more helpful than content from academic websites [5]. However, participants in that study were limited to a binary “reasonable/unreasonable” rating, lacking domain-specific evaluation. In contrast, our study used a detailed multi-domain scoring framework, allowing for a more nuanced assessment of LLM output quality.

Despite receiving oral and written instructions, a subset of patients remained non-compliant: 2% did not adhere to fasting guidelines, 7% took medications against medical advice, and 4% planned to drive themselves home following ambulatory surgery [12]. Enhancing patient understanding through accurate, comprehensive, personalized information about anesthesia may help improve adherence to perioperative instructions. ChatGPT provided more accurate and comprehensive perioperative instructions, which might lead to better compliance.

The observed low inter-rater agreement in our study highlights the inherent variability in expert evaluation of LLM-generated content. Several factors may contribute to this limited agreement. First, LLM outputs often include context-dependent language, which may be interpreted differently depending on the rater’s clinical background or expectations. Second, Likert-scale scoring systems, while widely used, may oversimplify judgments and fail to capture the reasoning behind a rater’s choice or ensure objective assessment [29]. Lastly, inherent cognitive biases, such as leniency, central tendency, or halo effects, may further influence ratings [30, 31]. These results underscore the need for structured, objective evaluation frameworks, such as incorporating reference standards, consensus panels, or automated scoring systems to complement subjective assessment in future LLM validation studies.

When it comes to readability, Gemini scored better on several established indices. However, neither model consistently reached the recommended readability level for health communication materials aimed at the general public. This finding is in line with earlier research comparing LLMs with traditional patient information leaflets (PILs), which also found that while AI-generated text may be simplified, it is not always ideal for lay audiences [28].

We also examined the emotional tone of the responses. Sentiment analysis showed that Gemini’s outputs had a greater emotional range, including more negative words, while ChatGPT’s tone was generally more neutral and slightly more positive overall. Previous studies have reported similar trends, suggesting that Gemini tends to produce more serious or affective language [32]. Our results do not clarify whether patients would perceive one model’s tone as more reassuring and the other as more anxiety-inducing. From a clinical viewpoint, these distinctions may influence patient engagement and trust. Although emotionally expressive language may enhance relatability, it may happen at the risk of sounding subjective or even dramatic. A more neutral tone, on the other hand, may affect patient engagement. This underscores the need for context-sensitive model selection in patient-facing applications. The divergence between word-level (Bing) and sentence-level (sentimentr) analyses highlights an important methodological consideration: medical counselling language often contains terms with inherently negative valence (‘pain’, ‘risk’, ‘complication’) that inflate lexicon-based counts, even when the overall sentence is neutral or reassuring. Context-sensitive approaches, such as sentimentr, may therefore provide a more clinically meaningful estimate of tone.

These tonal differences were also accompanied by qualitative variations in response styles. Gemini responses were more conversational and often included additional elements such as illustrative figures, tables, or references, even without being specifically prompted. These differences may reflect the underlying design philosophies of the two models [33]. Although large language models (LLMs) may generate hallucinations—false information arising from misinterpreted patterns in their training data or internal inconsistencies within the model [34], none of the responses contained inaccurate or potentially harmful content. Traditional online resources provide static, often jargon-heavy content that lacks personalization, can be overwhelming in volume and complexity, and fails to mimic conversational tone or address emotional nuances [32]. In contrast, though LLMs are still dependent on training data and prompt structure, they offer conversational responses and provide on-demand information with real-time interactivity [35, 36]. Thus, our findings suggest that LLMs show promise in generating patient-facing medical information, however, they should not be viewed as replacements for direct counselling by healthcare providers. Important nuances, emotional sensitivity, and the ability to tailor information to a patient’s background and concerns are aspects that AI cannot yet replicate reliably. Thus, the LLMs may bridge the gap between impersonal websites and clinician interaction, instead of being substitutes for personalised counselling.

Limitations

Although our study included a greater number of respondents than prior literature, the sample of anesthesiologists remained modest. The readability metrics applied, while widely used, have not been formally validated in the Indian population, and our evaluation was limited to a single surgical procedure and institution, which may affect generalizability. Patient comprehension and usability were not evaluated, which should be addressed in future work. Readability indices in our study were derived through an online tool rather than via a reproducible script after stripping hyperlinks/formatting. Future work should recompute readability metrics in a fully reproducible pipeline. Ratings were performed solely by anesthesiologists, which may not capture the layperson’s perspective. The reduction of 68 initial questions to 13 relied on expert judgment and readability filtering without direct patient involvement, which may introduce investigator bias; future studies should include patient panels or structured consensus methods (e.g., cognitive interviews or Delphi processes) to strengthen content validity. Lastly, responses were obtained from static versions of the LLMs at a fixed time point, which may not reflect future model performance as these systems evolve.

Conclusion

ChatGPT generated more accurate and complete perioperative anesthesia information for laparoscopic cholecystectomy, while Google Gemini produced responses that were comparatively easier to read and more emotionally expressive, reflecting a trade-off between clinical detail and ease of understanding. No unsafe or misleading information was identified in either model’s output. Within the limitations of this exploratory study: restricted to one surgical procedure, a single-country rater pool, and without patient comprehension testing, these findings suggest that LLMs may serve as supportive tools for patient education when used under expert supervision, but they cannot substitute for personalized counselling.