Abstract
Background
Large Language Models (LLMs) such as ChatGPT and Google Gemini are increasingly explored for their potential in patient education, particularly in the perioperative setting. As text-based tools trained on extensive datasets, they can generate detailed responses to common medical queries. However, their utility in delivering reliable, understandable, and emotionally appropriate preanesthetic information remains unclear.
Methods
We conducted a prospective observational study comparing ChatGPT and Google Gemini in generating educational content for patients undergoing laparoscopic cholecystectomy. From 68 patient questions submitted by anesthesiologists, 13 high-relevance items were selected. Responses from both models were independently rated by 20 anesthesiologists using a 5-point Likert scale across four domains: accuracy, comprehensiveness, clarity, and safety. Mixed-effects ordinal regression with random intercepts for rater and question estimated odds ratios (OR) and 95% confidence intervals (CI) for ChatGPT versus Gemini. Readability was assessed using standard linguistic indices, and sentiment analysis was performed. Inter-rater reliability was evaluated using Krippendorff’s α.
Results
ChatGPT had significantly higher odds of receiving better scores for accuracy (OR 2.32, 95% CI 1.62–3.32, p < 0.001) and comprehensiveness (OR 2.38, 95% CI 1.67–3.37, p < 0.001), with no differences for clarity (OR 1.05, 95% CI 0.75–1.47) or safety (OR 1.01, 95% CI 0.72–1.43). Gemini generated text with greater readability, demonstrated by a lower Flesch-Kincaid Grade level (p = 0.04) and higher Flesch Reading Ease score (p = 0.04). Sentiment analysis showed Gemini responses contained a wider emotional range, while ChatGPT responses were more neutral overall. Inter-rater reliability was low across domains (Krippendorff’s α 0.23–0.46).
Conclusion
ChatGPT produced more accurate and comprehensive perioperative anesthesia information, whereas Gemini offered greater readability and emotional expressiveness. Both models may serve as adjuncts in preanesthetic patient education but are not substitutes for clinician counselling. Larger, multi-center studies incorporating direct patient testing are warranted to validate these findings.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Artificial intelligence (AI) is poised to play a pivotal role in reshaping healthcare delivery and advancing patient outcomes. A significant role for AI in healthcare is the provision of medical information, facilitating patient education in a personalized and interactive manner [1]. Large Language Models (LLMs), a subset of AI programs, offer substantial potential for enhancing preanesthetic patient education and communication [2]. LLMs are trained using massive datasets, which render them capable of generating coherent and realistic human-like text responses based on patient-entered text prompts [3, 4]. Among the available LLM interfaces, deep learning tools with natural language processing like ChatGPT (OpenAI, San Francisco, CA, USA) and Google Gemini (Google AI, California, United States) have garnered significant interest and have already been evaluated for their overall capability to answer patient-related questions accurately in fields like ophthalmology, otolaryngology, neurosurgery, cardiology, orthopedics, and even in Medical Licensing Examinations [5,6,7,8,9].
Recent research indicates that inadequate patient counselling regarding anesthesia and surgery, postoperative course and the presence of negative beliefs and thoughts are strongly associated with an increased risk of psychiatric disorders. Patients require comprehensive information about the type of anesthesia, along with personalized details relevant to their condition and concerns, which helps dispel misconceptions, provides greater reassurance and ensures compliance with preoperative instructions [10,11,12]. A previous study reported that specialists in anesthesia care found ChatGPT-4 generated responses to common preoperative questions comparable to the information present on standard academic websites [5]. However, LLMs can sometimes generate inaccurate or misleading answers, known as “hallucinations,” which can undermine their reliability [4]. It remains unclear whether ChatGPT or Google Gemini can reliably understand and respond to questions about standard anesthesia care in real clinical settings, which typically demand extensive training and expertise from anesthetists. Hence, we conducted this study to compare the feasibility and utility of ChatGPT and Google Gemini in preanesthetic counselling and education of the patients. The primary objective was to compare the quality of the content generated by the two models, with secondary objectives including a comparison of readability and sentiment analysis of responses.
Methodology
This prospective observational study was conducted at a tertiary care hospital from September to November 2024 after approval of the Institutional Research Committee (AIMS/IRC-study 12/2024). As the study did not involve direct participation of patients, and given its nature, formal ethics committee approval was not required, and a waiver was obtained (AIMS/IEC-HR/2025/54). The research was carried out in accordance with ethical standards, with strict confidentiality and anonymity maintained. Consent by participating anesthesiologists was implied, and a detailed participant information sheet was provided (Supplementary File 1).
Selection of questions
A confidential, web-based questionnaire was distributed to anesthesiologists at the primary tertiary academic institution to collect commonly encountered preanesthetic questions from patients scheduled for laparoscopic cholecystectomy. A total of 22 anesthesiologists participated, submitting 68 unique questions. These were stratified into the following major domains: preoperative (evaluation, risk stratification, choice of anesthesia, preoperative instructions), intraoperative (management of awareness and pain, and expected complications), and postoperative (resumption of dietary intake, ambulation, complications, and follow-up). The expert panel, comprising senior study authors from academic institutions, employed a structured voting process to evaluate the questions and shortlisted 13 questions. Each expert independently rated the relevance and frequency of each question. Questions were then discussed collectively, and a consensus was required for a question to be included in the final set. To simulate real-world patient behaviour, submitted queries were reviewed in relation to commonly asked questions on laparoscopic cholecystectomy from patient-focused resources on hospital websites to ensure relevance and coverage [13,14,15]. Questions were reworded into simplified language reflecting typical layperson phrasing and adjusted to meet a Flesch Reading Ease Score (FRES) of ≥ 60, without additional system prompts or constraints, to simulate real-world patient queries where instructions such as audience level or response length are not typically specified. To ensure topic diversity, at least one question per domain was retained. Duplicates and infrequently asked questions lacking consensus were excluded. The final set comprised 13 questions (Table 1) representing the most commonly asked patient concerns prior to laparoscopic cholecystectomy.
Collection and grading of responses
Responses to the 13 selected questions were generated separately using the free web interface of ChatGPT (OpenAI, San Francisco, CA; version labelled ‘GPT-4’ in the interface at the time of access) via chat.openai.com and Gemini 1.5 (Google AI, California, USA) via gemini.google.com in October 2024, using a new session for each query. The free consumer-facing interfaces do not provide model IDs, build dates, or control over generation parameters (e.g., temperature, top_p, max tokens, safety configuration); therefore, all outputs reflect the default system settings as implemented by the providers at the time of data collection.
All standardized simulated interactions were conducted by a single author (A.R.) using the same device (MacBook Pro, Apple Inc., Cupertino, CA), web browser (Google Chrome, Google Inc., Mountain View, CA), and internet connection. To minimize bias, no “alternate answers” were requested at any point. Before each session, all chat windows were closed, browser history and cookies were cleared, and the computer was restarted to eliminate any potential influence of locally stored data on the chatbot’s performance. This approach was adopted because LLMs generate non-identical outputs upon repeated prompting, which may influence quality and accuracy.
Survey participants included anesthesiologists with a minimum of five years of clinical experience in academic medical institutions across India. All participants were actively involved in preoperative assessment and optimization of surgical patients. The paired responses from ChatGPT and Google Gemini were compiled into a survey form and distributed via email. To minimize the potential for recognition-based bias, the order of responses was randomized for each participant, and all the participants were blinded to group allocation.
Participants were instructed to evaluate each response on four parameters: accuracy, comprehensiveness, clarity, and safety, using a 5-point Likert scale (1 = very poor, 5 = excellent). Accuracy was defined as the degree to which the content reflected the latest evidence-based medical consensus. Comprehensiveness referred to the extent of topic coverage, including all necessary and relevant details. Clarity was judged based on how clearly and precisely the information was presented. Safety was defined as the absence of recommendations that could potentially cause harm or mislead patients.
In addition to expert grading, the readability of each response was assessed using established scores using a web-based readability checker tool (https://www.readable.com). The indices included the Flesch-Kincaid Grade Level (FKGL), Gunning Fog Index (GFI) and the FRES. FKGL assesses the grade level required to comprehend a text based on sentence length and syllable count [16]. The FRES evaluates the ease of reading, focusing on the number of words per sentence and the number of syllables per word [17]. The GFI estimates the number of years of schooling needed to comprehend a given text, emphasizing complex words and sentence length rather than individual words [18]. We also assessed the Coleman-Liau index, which determines readability by analysing the average number of letters per word and words per sentence, rather than syllables and the SMOG Index (Simple Measure of Gobbledygook), which predicts the ability to understand a text by counting polysyllabic words in a sample of sentences [19, 20]. In addition to this, the tone and emotional expression of the generated responses were also assessed using established methods.
Statistical analysis
As this was an exploratory study with no prior data available to estimate effect sizes for LLM evaluation in preanesthetic counselling, no a priori power calculation was performed. Instead, all available anesthesiologist raters at the study site during the study period were included.
Survey responses were exported from Google Forms and compiled using Microsoft Excel (version 16.76; Stanford, CA, USA). Statistical analyses were performed using R (version 4.5.0; R Foundation for Statistical Computing, Vienna, Austria) [21]. Descriptive means with 95% confidence intervals were calculated in two ways: (i) by question: computing the mean rating for each question across raters, then averaging across questions within each domain and model; and (ii) by rater: computing the mean rating for each rater across questions, then averaging across raters within each domain and model. This dual approach allowed descriptive evaluation from both question-centered and rater-centered perspectives.
To compare ratings between models, we used cumulative link mixed models (ordinal regression) with response score (1–5 Likert) as the outcome, model (ChatGPT vs. Gemini) as a fixed effect, and random intercepts for both rater and question to account for the repeated-measures design. Results are reported as odds ratios (OR) with 95% confidence intervals (CI), where OR > 1 indicates higher odds of a better score for ChatGPT compared to Gemini. P-values were adjusted for multiple testing across the four evaluation domains using the Benjamini–Hochberg procedure.
To assess robustness, we conducted sensitivity analyses. Mixed-effects ordinal regression models were re-estimated using (i) leave-one-rater-out and (ii) leave-one-question-out approaches, as well as with an alternate probit link function. These analyses tested whether results were influenced by individual raters, individual questions, or link specification.
Interrater reliability was assessed using Krippendorff’s α with an ordinal metric, appropriate for multi-rater Likert-scale data. Confidence intervals were obtained by bootstrapping across items. A P value of < 0.05 was considered statistically significant.
Full-text responses generated by both LLMs were subjected to sentiment analysis. The Bing lexicon was used to classify words as positive or negative, and net sentiment scores were calculated by subtracting negative from positive word counts. The NRC Emotion Lexicon was applied to quantify words associated with ten emotion categories (anger, anticipation, disgust, fear, joy, sadness, surprise, trust, positive, negative). Word frequencies were normalized and visualized using bar and radar plots. For both Bing and NRC analyses, text was tokenized at the word level using the tidytext package; stop words were retained, as neutral words do not affect classification. Sentence-level sentiment scoring was performed using the sentimentr package [22], which tokenizes by sentences and accounts for contextual modifiers such as negation (for instance: “not good”). For lexicon-based analyses, each full LLM response was the unit of analysis, whereas for sentence-level scoring, results were aggregated from individual sentences to the response level.
Results
A total of 20 anesthesiologists completed the evaluation survey, assessing the quality of responses generated by ChatGPT and Google Gemini across the four key domains (Fig. 1). In addition, linguistic complexity was evaluated using multiple standardized readability indices. Full transcripts of all model responses are provided in supplementary file 1.
Evaluation of content quality
Descriptive mean scores with 95% CI are presented in Table 2. ChatGPT received higher ratings than Gemini in accuracy (both by-question and by-rater) and comprehensiveness (both by-question and by-rater). Clarity and safety ratings were similar, with overlapping confidence intervals in both approaches.
Mixed-effects ordinal regression showed that ChatGPT had higher odds of receiving better scores for accuracy (OR 2.32, 95% CI 1.62–3.32, p < 0.001) and comprehensiveness (OR 2.38, 95% CI 1.67–3.37, p < 0.001) compared to Gemini. No significant differences were found for clarity (OR 1.05, 95% CI 0.75–1.47, p = 0.78) or safety (OR 1.01, 95% CI 0.72–1.43, p = 0.94). These results are summarized in Fig. 2, which presents a forest plot of OR with 95% confidence intervals for each evaluation domain.
Forest plot of mixed-effects ordinal regression comparing ChatGPT vs. Gemini across four evaluation domains. Points represent odds ratios (OR) with 95% confidence intervals on a log scale; vertical dashed line at OR = 1 indicates no difference. Domains marked with “#” are statistically significant after Benjamini–Hochberg adjustment
Sensitivity analyses confirmed the robustness of the findings. The OR from leave-one-rater-out and leave-one-question-out analyses remained within the 95% CIs of the primary models, and results were consistent using a probit link function. Full sensitivity results are presented in Supplementary File 2.
Interrater agreement analysis
Agreement among raters was generally low across all domains. Krippendorff’s α (95% CI) was 0.24 (0.18–0.35) for accuracy, 0.23 (0.19–0.33) for comprehensiveness, 0.37 (0.32–0.46) for clarity, and 0.46 (0.41–0.55) for safety, indicating only slight to fair reliability.
Evaluation of readability and linguistic complexity
To assess how easily patients could understand the content, responses were analyzed using established readability formulas. Table 3 compares the readability indices of ChatGPT and Google Gemini across five standardised metrics. ChatGPT generated more complex text, requiring a significantly higher reading level as measured by the FKGL (P = 0.04). Correspondingly, Google Gemini produced more readable content based on the FRES (P = 0.04), where higher scores indicate easier comprehension. Other indices corroborated this trend, though without reaching statistical significance. ChatGPT had higher values on the GFI and SMOG Index, while the Coleman–Liau Index was comparable between models. The SMOG Index approached significance (P = 0.056).
Despite differences in complexity, there were no significant variations in the length of responses. Word count, character count, and syllable count were comparable between the two models.
Sentiment and emotional tone analysis
Figure 3A and B illustrate the results of sentiment and emotional tone analysis. Bing lexicon analysis showed both models used more negative than positive words, with net sentiment scores of − 46 for ChatGPT and − 73 for Gemini. Radar plot analysis using the NRC Emotion Lexicon revealed that Google Gemini responses also had greater emotional diversity/variance, with higher frequencies of words associated with trust, joy, sadness, disgust amongst others, while ChatGPT responses showed comparatively fewer emotion laden words, except for anger. In Fig. 3B, the 0–1 scale simply shows the relative intensity of each emotion category, with values closer to 1.0 meaning that emotion was expressed more strongly in the responses. Sentence-level sentiment scoring (sentimentr) yielded mean values close to neutral for both models, though ChatGPT responses were marginally more positive (+ 0.109 vs. + 0.023). A summary of all three sentiment analyses is presented in Table 4. The R code used to implement these analyses has been provided as Supplementary File 3.
Sentiment and emotional tone analysis of LLM responses A Stacked bar chart based on the Bing sentiment lexicon; B Radar chart based on the NRC emotion lexicon (The 0–1 scale indicates the relative intensity of each emotion, with values closer to 1.0 meaning that emotion appeared more strongly in the model’s responses)
To complement quantitative sentiment scores, we visualized the most frequently used emotion-related words using word clouds. Low-value or contextually irrelevant terms were excluded using a custom stopword filter to improve interpretability. Supplementary file 4 is a figure depicting the Word cloud of the top 100–150 emotion-bearing words in ChatGPT and Google Gemini responses. Words are sized proportionally to their frequency, highlighting the most emotionally expressive vocabulary used by the model.
Discussion
In this study, we provide novel evidence comparing the performance of two LLMs, ChatGPT and Google Gemini, in generating appropriate and safe preanesthetic educational content for patients undergoing laparoscopic cholecystectomy. While prior literature has focused on the ability of the LLM’S to answer medical exam-style questions with factual accuracy [23], our study shifts the focus to how well these models communicate medical information in a way that is accurate, clear, comprehensive, and suitable for patients. Our findings highlight two key patterns. First, ChatGPT outperformed Gemini in content-related domains such as accuracy and comprehensiveness, as rated by experienced anesthesiologists, similar to some other studies [24]. Second, Gemini generated responses with better readability, producing text that was generally easier to understand. This reflects a trade-off between delivering medically detailed content and using language that is more accessible to the average patient.
Although our study demonstrated that ChatGPT excelled in domains of accuracy and comprehensiveness, it is essential to acknowledge the fundamental differences between the two models. The version of ChatGPT used was trained on a large dataset up to a fixed cut-off date, whereas Gemini incorporates real-time web access, potentially allowing for more up-to-date information [25, 26]. This may explain why some studies have reported that ChatGPT struggles with consistently providing guideline-concordant clinical recommendations [27]. However, our study was not designed to assess how current or updated the responses were, but to assess the utility of LLMs for standardized preoperative education. Differences in model architecture and training approaches may account for this observed performance variability [28].
Previous studies evaluating LLMs in patient-facing contexts have generally supported their potential in enhancing health communication. For instance, Segal et al. assessed GPT-4 responses to common preanesthetic questions and found them to be clearer and more helpful than content from academic websites [5]. However, participants in that study were limited to a binary “reasonable/unreasonable” rating, lacking domain-specific evaluation. In contrast, our study used a detailed multi-domain scoring framework, allowing for a more nuanced assessment of LLM output quality.
Despite receiving oral and written instructions, a subset of patients remained non-compliant: 2% did not adhere to fasting guidelines, 7% took medications against medical advice, and 4% planned to drive themselves home following ambulatory surgery [12]. Enhancing patient understanding through accurate, comprehensive, personalized information about anesthesia may help improve adherence to perioperative instructions. ChatGPT provided more accurate and comprehensive perioperative instructions, which might lead to better compliance.
The observed low inter-rater agreement in our study highlights the inherent variability in expert evaluation of LLM-generated content. Several factors may contribute to this limited agreement. First, LLM outputs often include context-dependent language, which may be interpreted differently depending on the rater’s clinical background or expectations. Second, Likert-scale scoring systems, while widely used, may oversimplify judgments and fail to capture the reasoning behind a rater’s choice or ensure objective assessment [29]. Lastly, inherent cognitive biases, such as leniency, central tendency, or halo effects, may further influence ratings [30, 31]. These results underscore the need for structured, objective evaluation frameworks, such as incorporating reference standards, consensus panels, or automated scoring systems to complement subjective assessment in future LLM validation studies.
When it comes to readability, Gemini scored better on several established indices. However, neither model consistently reached the recommended readability level for health communication materials aimed at the general public. This finding is in line with earlier research comparing LLMs with traditional patient information leaflets (PILs), which also found that while AI-generated text may be simplified, it is not always ideal for lay audiences [28].
We also examined the emotional tone of the responses. Sentiment analysis showed that Gemini’s outputs had a greater emotional range, including more negative words, while ChatGPT’s tone was generally more neutral and slightly more positive overall. Previous studies have reported similar trends, suggesting that Gemini tends to produce more serious or affective language [32]. Our results do not clarify whether patients would perceive one model’s tone as more reassuring and the other as more anxiety-inducing. From a clinical viewpoint, these distinctions may influence patient engagement and trust. Although emotionally expressive language may enhance relatability, it may happen at the risk of sounding subjective or even dramatic. A more neutral tone, on the other hand, may affect patient engagement. This underscores the need for context-sensitive model selection in patient-facing applications. The divergence between word-level (Bing) and sentence-level (sentimentr) analyses highlights an important methodological consideration: medical counselling language often contains terms with inherently negative valence (‘pain’, ‘risk’, ‘complication’) that inflate lexicon-based counts, even when the overall sentence is neutral or reassuring. Context-sensitive approaches, such as sentimentr, may therefore provide a more clinically meaningful estimate of tone.
These tonal differences were also accompanied by qualitative variations in response styles. Gemini responses were more conversational and often included additional elements such as illustrative figures, tables, or references, even without being specifically prompted. These differences may reflect the underlying design philosophies of the two models [33]. Although large language models (LLMs) may generate hallucinations—false information arising from misinterpreted patterns in their training data or internal inconsistencies within the model [34], none of the responses contained inaccurate or potentially harmful content. Traditional online resources provide static, often jargon-heavy content that lacks personalization, can be overwhelming in volume and complexity, and fails to mimic conversational tone or address emotional nuances [32]. In contrast, though LLMs are still dependent on training data and prompt structure, they offer conversational responses and provide on-demand information with real-time interactivity [35, 36]. Thus, our findings suggest that LLMs show promise in generating patient-facing medical information, however, they should not be viewed as replacements for direct counselling by healthcare providers. Important nuances, emotional sensitivity, and the ability to tailor information to a patient’s background and concerns are aspects that AI cannot yet replicate reliably. Thus, the LLMs may bridge the gap between impersonal websites and clinician interaction, instead of being substitutes for personalised counselling.
Limitations
Although our study included a greater number of respondents than prior literature, the sample of anesthesiologists remained modest. The readability metrics applied, while widely used, have not been formally validated in the Indian population, and our evaluation was limited to a single surgical procedure and institution, which may affect generalizability. Patient comprehension and usability were not evaluated, which should be addressed in future work. Readability indices in our study were derived through an online tool rather than via a reproducible script after stripping hyperlinks/formatting. Future work should recompute readability metrics in a fully reproducible pipeline. Ratings were performed solely by anesthesiologists, which may not capture the layperson’s perspective. The reduction of 68 initial questions to 13 relied on expert judgment and readability filtering without direct patient involvement, which may introduce investigator bias; future studies should include patient panels or structured consensus methods (e.g., cognitive interviews or Delphi processes) to strengthen content validity. Lastly, responses were obtained from static versions of the LLMs at a fixed time point, which may not reflect future model performance as these systems evolve.
Conclusion
ChatGPT generated more accurate and complete perioperative anesthesia information for laparoscopic cholecystectomy, while Google Gemini produced responses that were comparatively easier to read and more emotionally expressive, reflecting a trade-off between clinical detail and ease of understanding. No unsafe or misleading information was identified in either model’s output. Within the limitations of this exploratory study: restricted to one surgical procedure, a single-country rater pool, and without patient comprehension testing, these findings suggest that LLMs may serve as supportive tools for patient education when used under expert supervision, but they cannot substitute for personalized counselling.
Data availability
The datasets supporting the conclusions of this article are included within the article and its additional files.
References
Dave M, Patel N. Artificial intelligence in healthcare and education. Br Dent J. 2023;234:761–4.
Reddy A, Patel S, Barik AK, Gowda P. Role of chat-generative pre-trained transformer (ChatGPT) in anaesthesia: merits and pitfalls. Indian J Anaesth. 2023;67(10):942–4.
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29:1930–40.
Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: applications and implications. JMIR Med Educ. 2023;9:e50945.
Segal S, Saha AK, Khanna AK. Appropriateness of answers to common preanesthesia patient questions composed by the large language model GPT-4 compared to human authors. Anesthesiology. 2024;140(2):333–5.
Subramanian T, Araghi K, Amen TB, Kaidi A, Sosa B, Shahi P, et al. Chat generative pretraining transformer answers Patient-focused questions in cervical spine surgery. Clin Spine Surg. 2024;37(6):E278–81.
Lorenzi A, Pugliese G, Maniaci A, Lechien JR, Allevi F, Boscolo-Rizzo P, et al. Reliability of large language models for advanced head and neck malignancies management: a comparison between ChatGPT 4 and Gemini advanced. Eur Arch Otorhinolaryngol. 2024;281(9):5001–6.
Carlà MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E et al. Exploring AI-chatbots’ capability to suggest surgical planning in ophthalmology: ChatGPT versus Google Gemini analysis of retinal detachment cases. Br J Ophthalmol. 2024: bjo-2023-325143.
Krittanawong C, Rodriguez M, Kaplin S, Tang WHW. Assessing the potential of ChatGPT for patient education in the cardiology clinic. Prog Cardiovasc Dis. 2023;81:109–10.
Kapoor I, Singh DJ, Prabhakar H, Mahajan C, Chaturvedi A, Pandey S. Role of preoperative anesthesia counseling in the neurosurgical patients: a randomized controlled open-label study. World Neurosurg. 2024;182:1–5.
Lee A, Chui PT, Gin T. Educating patients about anesthesia: a systematic review of randomized controlled trials of media-based interventions. Anesth Analg. 2003;96(5):1424–31.
Laffey JG, Boylan JF. Patient compliance with pre-operative day case instructions. Anaesthesia. 2001;56(9):910.
Max Healthcare. Cholecystectomy (Gallbladder Surgery). Delhi: Max Healthcare; [date unknown]. Available from: https://www.maxhealthcare.in/procedures/gall-bladder-surgery. Accessed 8 Oct 2025.
Asian Heart Institute. Gallbladder Removal Surgery Cost in Mumbai, India. Mumbai: Asian Heart Institute; 2024. Available from: https://asianheartinstitute.org/blog/gallbladder-removal-surgery-cost-in-mumbai-india/. Accessed 29 Oct 2025.
Nanavati Max Super Speciality Hospital. Gallbladder Laparoscopic Surgery. Mumbai: Nanavati Max Super Speciality Hospital; [date unknown]. Available from: https://www.nanavatimaxhospital.org/procedures/gallbladder-laparoscopic-surgery. Accessed 8 Oct 2025.
Flesch RA. New readability yardstick. J Appl Psychol. 1948;32(3):221–33.
Kincaid JP, Fishburne RP Jr, Rogers RL, Chissom BS. Derivation of new readability formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) for Navy enlisted personnel. Research Branch Report 8-75. Chief of Naval Technical Training, Naval Air Station Memphis; 1975. https://stars.library.ucf.edu/istlibrary/56. Accessed 8 Oct 2025.
Li C, Wang X, Qian L. Exploring syntactic complexity and text readability in an ELT textbook series for Chinese English majors. Sage Open. 2025;15(1).
Coleman M, Liau TL. A computer readability formula designed for machine scoring. J Appl Psychol. 1975;60(2):283–4.
McLaughlin GH. SMOG grading: a new readability formula. J Read. 1969;12(8):639–46.
R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. 2024. Available from: https://www.r-project.org/
Rinker TW. sentimentr: Calculate Text Polarity Sentiment. Rinker; 2023. https://github.com/trinker/sentimentr. Accessed 8 Oct 2025.
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit Health. 2023;2(2):e0000198.
Cheong RCT, Unadkat S, Mcneillis V, Williamson A, Joseph J, Randhawa P, et al. Artificial intelligence chatbots as sources of patient education material for obstructive sleep apnoea: ChatGPT versus Google bard. Eur Arch Otorhinolaryngol. 2024;281(2):985–93.
OpenAI. GPT-4 System Card. OpenAI. 2023. Accessed 4 Jun 2025. https://openai.com/research/gpt-4
Google. Gemini 2.0: Level up your apps with real-time multimodal interactions. Google Developers Blog. 2024. Accessed 4 Jun 2025. https://developers.googleblog.com/en/gemini-2-0-level-up-your-apps-with-real-time-multimodal-interactions/
Chen S, Kann BH, Foote MB, Aerts HJWL, Savova GK, Mak RH, et al. Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol. 2023;9(10):1459–62.
Gondode P, Duggal S, Garg N, Lohakare P, Jakhar J, Bharti S, et al. Comparative analysis of accuracy, readability, sentiment, and actionability: artificial intelligence chatbots (ChatGPT and Google Gemini) versus traditional patient information leaflets for local anesthesia in eye surgery. Br Ir Orthopt J. 2024;20(1):183–92.
Joshi A, Kale S, Chandel S, Pal DK. Likert scale: explored and explained. British Journal of Applied Science & Technology. 2015;7(4):396–403.
Jamieson S. Likert scales: how to (ab)use them. Med Educ. 2004;38(12):1217–8.
Bishop PA, Herron RL. Use and misuse of the likert item responses and other ordinal measures. Int J Exerc Sci. 2015;8(3):297–302.
Gondode PG, Singh R, Mehta S, Singh S, Kumar S, Nayak SS. Artificial intelligence chatbots versus traditional medical resources for patient education on labour epidurals: an evaluation of accuracy, emotional tone, and readability. Int J Obstet Anesth. 2025;61:104302.
Zitter L, Raffo D. Gemini vs. ChatGPT: What’s the difference? TechTarget. Published February 28, 2025. Accessed 4 Jun 2025. https://www.techtarget.com/searchenterpriseai/tip/Gemini-vs-ChatGPT-Whats-the-difference
Alkaissi H, McFarlane SI. Artificial hallucinations in chatgpt: implications in scientific writing. Cureus. 2023;15(2):e35179.
Creatix Medium. Is ChatGPT better than Gemini? Medium. Published September 9, 2023. Accessed 4 Jun 2025. https://medium.com/@creatix/is-chatgpt-better-than-w-c2f961ca8577
Pagoto S, Nebeker C. How scientists can take the lead in establishing ethical practices for social media research. J Am Med Inform Assoc. 2019;26(4):311–3.
Acknowledgements
Nil.
Funding
Support was provided solely from institutional and/or departmental sources. No funds, grants, or other support were received.
Author information
Authors and Affiliations
Contributions
PS, AR, BS and ARay were involved in Study conception and design. AR, PS, BS, RKG, RG, RC, and Aray were involved in the acquisition, analysis and interpretation of data. ARay, PS, RG and RC were involved in draft manuscript preparation. All authors were involved in the critical revision of the manuscript for important intellectual content: The final version of the manuscript has been read and approved by all the authors.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
The study conformed to the standards of the Declaration of Helsinki and was approved by the Institute Research Committee (AIMS/IRC-study 12/2024).
Consent for publication: This prospective observational study took place at a tertiary care hospital from September to November 2024 after approval of the institute research committee (AIMS/IRC-study 12/2024). As this study did not involve direct participation of patients and due to its nature, formal ethics committee approval was not required. However, the research was conducted in full compliance with ethical standards, maintaining strict confidentiality and anonymity of all data. Consent by participating anesthesiologists was implied, and a detailed information sheet was also enclosed.
Consent for publication
Informed consent has been obtained from all the anesthesiologists participating in this study, including consent for publication.
Competing interests
The authors declare no competing interests. The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sharma, P., Sidhu, B., Reddy, A. et al. Artificial intelligence in anesthesia: comparison of the utility of ChatGPT v/s google gemini large language models in pre-anesthetic education: content, readability and sentiment analysis. BMC Anesthesiol 25, 574 (2025). https://doi.org/10.1186/s12871-025-03451-x
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1186/s12871-025-03451-x






