Abstract
Background
Current prediction models are suboptimal for determining mortality risk in patients with acute pancreatitis (AP); this might be improved by using a machine learning (ML) model. In this study, we aimed to construct an explainable ML model to calculate the risk of mortality in patients with AP admitted in intensive care unit (ICU) and compared it with existing scoring systems.
Methods
A gradient-boosting ML (XGBoost) model was developed and externally validated based on two public databases: Medical Information Mart for Intensive Care (MIMIC, training cohort) and the eICU Collaborative Research Database (eICU-CRD, validation cohort). We compared the performance of the XGBoost model with validated clinical risk scoring systems (the APACHE IV, SOFA, and Bedside Index for Severity in Acute Pancreatitis [BISAP]) by area under receiver operating characteristic curve (AUC) analysis. SHAP (SHapley Additive exPlanations) method was applied to provide the explanation behind the prediction outcome.
Results
The XGBoost model performed better than the clinical scoring systems in correctly predicting mortality risk of AP patients, achieving an AUC of 0.89 (95% CI: 0.84–0.94). When set the sensitivity at 100% for death prediction, the model had a specificity of 38%, much higher than the APACHE IV, SOFA and BISAP score, which had a specificity of 1%, 16% and 1% respectively.
Conclusions
This model might increase identification of very low-risk patients who can be safely monitored in a general ward for management. By making the model explainable, physicians would be able to better understand the reasoning behind the prediction.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
Introduction
Acute pancreatitis (AP) is a common gastrointestinal disease for acute hospitalization, with an annual incidence of 34.8/100,000 person-years globally [1]. Clinical outcomes vary broadly in AP from a self-limiting mild disease course (rarely die) in the majority of patients, to a severe course complicated by multiple organ failure in approximately 20% of patients (with a mortality of 36–50%) [2,3,4]. Risk stratification is important for triaging patients to the appropriate level of care and for deciding the aggressiveness of intervention. For this reason, approximately 20 scoring systems have been developed and evaluated since 1974 for AP severity assessment and mortality prediction [5], including the commonly used APACHE II [6], Ranson score [7], and Bedside Index for Severity in Acute Pancreatitis (BISAP) [8].
One major limitation of the scoring systems is that they are frequently cumbersome to calculate (such as APACHE II score) in clinical practice [5], and physicians continue to rely mainly on comorbidities, vital signs, and certain laboratory data to determine the severity for triaging and managing patients. This traditional strategy is experience depended and might lead to a significant number of AP patients unnecessarily admitted to the intensive care unit (ICU) for observation.
All current scoring systems use standard statistical methods to identify predictors and most allocate fixed weights according to the original dataset that used for model construction. Machine learning (ML) is a strategy that applies computational algorithms to learn from data, and the performance improves with experience (i.e., more training data) for executing a specific task. However, several factors have limited the studies of ML models in AP, such as small sample sizes, absence of external validation, or absence of head-to-head comparisons with existing risk tools [9,10,11].
Currently, the available evidence on risk assessment scores for patients with AP does not provide clear guidance on how to identify very low-risk patients who might be safely monitored in a general ward but not in ICU, thus to enhance the cost-effectiveness of care. The aim of this study was to develop and validate a ML model to identify very-low risk patients admitted to the ICU for AP, and perform a head-to-head comparison of its performance to existing clinical risk scores. Since one of the limitations for applying the ML model in clinical practice is the “black-box” pattern, we used explainable ML methods to provide a transparent interpretation on how a prediction was made.
Methods
Data source and outcome
Data were collected from two sizeable critical care databases: the MIMIC (MIMIC-III and MIMIC-IV, training set) and eICU-CRD (validation set) in accordance with the ethical standards of the institutional review board of the Massachusetts Institute of Technology (no. 0403000206) and with the 1964 Helsinki declaration and its later amendments. As the database is de-identified, we do not need to obtain informed consent from patients. The MIMIC-IV is the latest version of MIMIC database, which currently contains comprehensive and high-quality data of patients admitted to ICUs at the Beth Israel Deaconess Medical Center between 2008 and 2019. The MIMIC-IV covered 524,520 ICU admissions between 2008 and 2019 of 257,366 patients. In order to maximize the prediction power of the ML model, we also searched for eligible AP patients from the earlier version of MIMIC-III that was not included in the MIMIC-IV (2001–2008) for model construction. The eICU-CRD comprises 200,859 ICU admissions for 139,367 unique patients admitted between 2014 and 2015 at 208 hospitals located throughout the US. The outcome for this study was all-cause in hospital mortality of patients admitted to the ICU. One author (Dr. MJ) obtained access to both databases and was responsible for data extraction and analysis. The study was reported abide by the recommendations of the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) statement [12].
Selection of participants
The inclusion criterion for the study was adult patients (≥ 18 years old) admitted to ICU with AP. The diagnosis of AP (ICD-9: 5770) requires two of the following three features [4]: (1) abdominal pain (acute onset of a persistent epigastric pain, often radiating to the back); (2) serum lipase activity at least three times greater than the upper limit of normal; and (3) characteristic findings on contrast-enhanced computed tomography (CECT) or transabdominal ultrasonography. Patients who died within the first 24 h of their stay were excluded. Initial assessment of patients was performed when patients presented in ICU.
Data collection
We collected data from the electronic health records including demographical characteristics, vital signs, laboratory results, treatment and Glasgow Coma Scale (GCS) from the first 24 h in ICU, since: (1) the first 24 h of ICU admission are often considered a critical window for assessing the severity of AP and predicting patient outcomes; (2) many widely used clinical scoring systems, such as the APACHE II, SOFA, and BISAP scores, also rely on data collected within the first 24 h of ICU admission. This consistency allows for a more direct comparison between our ML model and these established scoring systems; (3) the first 24 h of ICU admission are when the most comprehensive data are typically available, ensuring that our model can be applied in real-world clinical settings.
The features were then analyzed for statistical and clinical significance with mortality, and a subset was selected as predictors for the ML model. The mean values in the first 24 h in the ICU were used for the vital sign: respiratory rate (RR), oxygen saturation (SpO2), heart rate (HR), systolic/diastolic blood pressure, mean arterial pressure (MAP), and temperature. For laboratory information, the maximum value in the first 24 h was selected for: serum creatinine, blood urea nitrogen (BUN), lipase, total bilirubin, lactate, prothrombin time (PT), international normalized ratio (INR), alanine aminotransferase (ALT), aspartate aminotransferase (AST), alkaline phosphatase (ALP), white blood cells (WBC), potassium, sodium, hematocrit and glucose. The minimum value in first 24 h was chosen for serum bicarbonate, total calcium, albumin, platelets, and hemoglobin (Hb). Treatment information was included for the use of vasopressors, underwent invasive mechanical ventilation and renal replacement treatment (RRT) in the first 24 h. Comorbidities were recorded as binary variables for the presence of hepatic cirrhosis, metastatic cancer, diabetes mellitus, chronic obstructive pulmonary disease, congestive heart failure, and renal failure. Because missing data could create bias, variables with > 20% missing values were excluded from further analysis. To reduce the impact of missing data on model construction, we used the KNNImputer (KNN) method to impute data missing less than 20%. We selected KNNImputer for its ability to preserve data structure, handle non-linear relationships, and maintain computational efficiency compare to other imputation techniques (e.g., mean imputation or multiple imputation by chained equations).
Study design
The MIMIC dataset was used to train a boosted ensemble model (XGBoost, using the open-source XGBoost library) and to conduct internal validation using tenfold cross validation. For internal validation, the MIMIC dataset was divided into 10 folds and each fold was used for internal validation with the remaining 90% folds as training set. Performing cross-validation is considered a robust strategy for a ML model assessment prior to external validation, which could maximize the potential prediction performance [13, 14]. External validation was performed on the eICU-CRD cohort.
For AP patients, using risk scores in clinical practice to identify very low-risk patients who might be safely monitored in a general ward or even managed in outpatient is rarely discussed. Achieving a high sensitivity for death prediction has obvious advantage, because false negatives need to be very rare so that patients who will become deteriorate or die can obtain the necessary intensive care. Therefore, to evaluate the clinical utility of the ML model, we planned to choose the low-risk cutoff value for the clinical risk tools and ML model by making sensitivity at 100% (or nearly 100% if none reached 100%) for death prediction in the external validation cohort, and to compare the specificity of different models [15, 16]. The higher specificity indicates that the tool would identify a greater proportion of patients presenting with AP who could be safely monitored in a step-down unit.
Prior to choosing and optimizing the XGBoost model, we performed rigorous exploratory analyses of logistic regression, random forest, and k-nearest neighbor (kNN). Separate models were developed with hyperparameter to optimize their respective performance, and all models underwent external validation. The preliminary findings suggested that the XGBoost model appeared to perform best on the eICU-CRD validation cohort. XGBoost is a recently generated gradient tree boosting algorithm, which is scalable and allows for faster computation [17]. Another advantage is that this algorithm can handle missing data effectively, whereas for other ML algorithms, we had to perform suitable data imputations, and the data imputation strategies might impact the comparisons between different models. Thus, the XGBoost was preferred over the other ML models in this task.
Model interpretation
To better interpret the prediction of the XGBoost model, we employed the SHapley Additive exPlanations (SHAP) method [18], which shows the summing effects of all variable attributions for approximating each patient’s predicted outcome.
Statistical analysis
Continuous covariates are presented as median (interquartile range [IQR]) or mean (standard deviation [SD]) based on the normality of the data distribution, and analyzed using either the Student’s t test or the Manne-Whitney U-test as appropriate. Categorical variables are reported as number and percentage, and analyzed by Chi-Squared test or Fisher’s exact test. We used stepwise logistic regression model to select variables that were predictive of mortality. Both forward selection and backward elimination were applied, testing at each step for parameters to be included or excluded. The selection criteria to eliminate the predictors was Akaike Information Criterion (AIC).
For ML models, the XGBoost version 0.8 and scikit-learn version 2.1 packages were employed to develop models and tune hyperparameters in Python version 3.6 [17, 19]. Our primary metric was to compare the area under receiver operating characteristic curves (AUCs) for ML models with the SOFA, APACHE IV, and BISAP risk scores. The nonparametric DeLong test was used for comparing different AUCs [20]. The model’s performance in classification that assigning patients into risk categories (e.g., low-risk vs. high-risk) was evaluated via sensitivity and specificity. Calibration of the model was assessed using the calibration curve and Hosmer-Lemeshow test as well as Brier score.
Results
Baseline characteristics
As shown in Figs. 1 and 1782 patients with AP in MIMIC (MIMIC-III: 942 and MIMIC-IV: 840) were included in the training cohort. A total of 250 patients died in hospital (14.0%, 95% CI: 12.4-15.7%). A cohort of 507 AP patients in eICU-CRD was included as external validation set.
The continuous and categorical variables (Table 1) were analyzed for association with mortality in both the MIMIC and eICU-CRD cohorts. Of the 1782 patients included in the training set, the median age was 67.1 years (IQR, 55.3–80.0]) in the non‑survivors, which was significantly older than the survivors of 61.0 years (IQR, 48.0-77.2) (P < 0.001). Most patients (46.1%) were admitted to the ICU from the emergency room, whereas 35.3% were transferred from the general ward. The remaining admissions were transferred from other hospitals.
The median age for the eICU-CRD group was 61.0 years (IQR, 48.0-77.2) for non-survivors compared with 52.0 years (IQR, 41.0–63.0) for survivors (P = 0.012). Comparisons of the training and external validation cohorts are shown in Supplementary Table 1. The mortality rate was 6.7% (95% CI: 4.7%-9.25) in the eICU-CRD group.
Construction of the XGBoost model
For the training cohort, the variables with the greatest missingness include albumin (N = 537), lactate (N = 1012), and lipase (N = 1059). They were excluded for further analysis. The features included for the ML model development were summarized in Table 2.
The hyperparameters used for training our XGBoost model were listed in Supplementary Table 2. With these hyperparameters, the training process is presented in Supplementary Fig. 1. As shown, the model performed best at the training rounds of 24th, achieving the maximum AUC and minimum log-loss in the eICU-CRD validation cohort. To identify the variables that influenced the model the most, we depicted the SHAP summary plot (Fig. 2) and presented the top 25 variables of the prediction model. This plot depicts the associations between variables’ values and SHAP values in the training dataset. According to the XGBoost model, the higher the SHAP value of a variable, the more likely for patients will die in hospital. The SHAP dependence plot (Supplementary Fig. 2) can also be applied to understand how a specific feature affects the prediction outcome. We could visualize how the variable’s attributed importance altered as its values varied in the plot. SHAP values for a feature exceeding zero imply an increased risk of mortality.
SHAP summary plot of the top 25 features of the XGBoost model. The higher the SHAP value of a variable, the higher the probability of mortality. A dot is created for each feature attribution value for the model of each patient, and thus one patient is allocated one dot on the line for each variable. Dots are colored according to the values of variables for the respective patient and accumulate vertically to depict density. Red represents higher variable values, and blue represents lower variable values
Performance of the machine learning model
The ML model was tested on the eICU-CRD cohort. For predicting in hospital death, the XGBoost model performed better than other ML models, achieving an AUC of 0.89 (95% CI: 0.84–0.94) than the random forest AUC of 0.80 (95% CI: 0.73–0.87) and kNN AUC of 0.63 (95% CI: 0.53–0.72), with a P value of 0.01 and less than 0.001 respectively (Table 3a). Compared with APACHE IV and SOFA scores, the AUC of XGBoost was higher but the difference was not significant. The BISAP only achieved an AUC of 0.67 (95% CI: 0.58–0.76) for predicting mortality in this cohort. Figure 3 displays the ROCs for these models in the external validation cohort. The calibration curve (Supplementary Fig. 3) and Hosmer-Lemeshow test (P = 0.342) for the eICU-CRD validation cohort demonstrated a high accuracy of the ML model. Besides, the model achieved a Brier score of 0.08, further indicating the strong calibration (lower values denote better agreement between predicted and observed outcomes).
Identifying very low-risk patients
To identify very low-risk patients, we setted the sensitivity of the ML models and clinical risk scores at 100% for predicting death in the validation cohort. These models were then compared in view of the specificity. The XGBoost model performed better than the APACHE IV, SOFA and BISAP risk scores in correctly identifying patients who were at very low risk of death (true negatives). The XGBoost model achieved the specificity of 38% at a sensitivity of 100%, much higher than the APACHE IV, SOFA and BISAP score, which had a specificity of 1%, 16% and 1% respectively at the sensitivity of 100% (Table 3b).
Discussion
In patients with AP, a XGBoost model derived from a large cohort predicts the mortality better than the current commonly used clinical risk scores. The ML model relies on capturing important feature interactions to generate predictions, which could be used to improve the ability to identify very-low risk AP patients who can be safely monitored on a general medical ward. Importantly, the model increases the number of very-low risk AP patients who can be detected by more than two-fold as compared with the best performing clinical risk scale currently available (SOFA score).
AP is a potentially deadly disease; however, the clinical outcomes vary broadly among different cases. Thus, it is important to predict patient outcomes to facilitate clinicians to choose the appropriate level of care (i.e., transferred to a general ward or continue monitored in an ICU), to determine the aggressiveness of treatment, and to counsel patients’ prognosis [5]. Although several guidelines recommend the application of certain scoring systems in clinical decision-making [21,22,23], it remains uncertain on how the results should guide therapy. Especially, no relevant studies were available to date that addressed the cost-effectiveness of these scoring systems. Currently, triaging AP patients is still strongly relying on the doctor’s clinical assessment. This has raised concern about unnecessary ICU admissions and cost utilization. Our model’s performance at identifying low-risk patients is promising, which could be integrated into clinical workflows (it would be integrated with electronic health record systems to provide real-time risk assessments) that if a patient with AP is predicted as at very low risk, she/he might be treated in a general ward but not in the ICU.
ICU provides an optimal environment for supportive care, has the ability to monitor and manage severe volume depletion and organ insufficiency. However, due to that intensive care is a limited resource, its valuable to screen for those very-low risk patients who could be safely managed in a general ward, so that the real potential severe cases would receive the appropriate intensive care. In this study, we compared our ML model’s prediction performance with the APACHE IV, SOFA and BISAP score and found that the model was able to identify very low-risk cases (setting sensitivity at 100%) with 38% specificity compared with 1% specificity for the APACHE IV score, 16% specificity for the SOFA score, and 1% specificity for the BISAP score. This means that the model could screen for 38% of patients who are at very low risk of mortality and may not need to be admitted to the ICU for monitoring.
Explainability
The key factors in determining if the ML model predictions would be used by clinicians for clinical decision-making is if they can understand on how the prediction is made. The higher the interpretability for a model, the easier it is for clinicians to comprehend and thus make an appropriate clinical decision accordingly. In highly interpretable models, such as sparse linear models, the coefficients of each feature indicates how much weight has been assigned to that factor; whereas in the black box models like XGBoost, it is unknown which variables are contributing to the decision for a specific case. In order to understand why the model came to a certain prediction, we used the SHAP method [24, 25] to explain individual predictions of three patients from the hold-out eICU-CRD cohort (Fig. 4).
SHAP explanation force plot for 3 patients from the eICU-CRD validation cohort of the ML model. ALT, aminotransferase alanine; BUN, blood urea nitrogen; GCS, Glasgow Coma Scale/Score; HR, heart rate; INR, international normalized ratio; MAP, mean arterial pressure; PT, prothrombin time; RR, respiratory rate; SPO2, oxygen saturation; WBC, white blood cell; Na+, sodium; K+, potassium; SBP, systolic blood pressure; RR, respiratory rate; INR, international normalized ratio; DBP, diastolic blood pressure; Scr, serum creatinine; HCT, hematocrit
ML explainability results for three patients
Patient 1.
This is a 64-year-old female patient who was admitted to the ICU for AP. This patient developed septic shock and ARDS while in the ICU and was intubated during this hospitalization. The predicted probability of death is 9% for her compared with the baseline of 7.8%. The factors identified by the ML model for predicting a higher mortality for this patient include decreased systolic blood pressure (95 mm Hg), low GCS score (8), borderline MAP (70 mmHg), sodium (145 mmol/L) and HR (91 beats/min). The predicted outcome of the model was death for this case, and the actual outcome was also death (true positive). However, the model predicted the mortality risk to be 9% based on information of the first 24 h after admission. The patient’s length of stay was 118 h (approximately 5 days).
From a clinicians’ point of view, ICU observation is justified for this patient, considering the low systolic blood pressure and GCS score around the time of admission in addition to development of septic shock later in the ICU. Supplementary Fig. 4 presented a visualization of the whole ICU stay for this patient. The patient’s GCS score dropped to 3 after 28 h of admission (was intubated at this point), and the MAP fluctuated broadly during the ICU duration.
Patient 2.
This is a 23-year-old male patient with no comorbidities who presented with AP. This patient neither undergone intubation or RRT nor developed an infection during this hospitalization. The predicted probability for death by the ML model was 1% compared to the baseline of 7.8%. The model detected RR of 26 breaths/min, glucose of 196 mg/dL, ALT of 32 U/L, and potassium of 5.5 mmol/L as the risk factors for mortality, whereas the young age of 23 years, normal serum creatinine of 0.53 mg/dL, BUN of 7 mg/dL, and systolic blood pressure of 130 mm Hg decreased the risk of mortality. The model’s predicted outcome was alive, in line with the final outcome of alive (true negative).
Because we aimed to use the ML model to screen for very low-risk patients, this is the group of interest for this study. This patient was observed in ICU for 57 h. However, from the clinical perspective of the whole ICU stay for this patient (Supplementary Fig. 5), this patient may not need an ICU level of care and might be safely dispositioned in a general unit.
Patient 3.
This is a 40-year-old diabetic woman with cirrhosis and a liver transplant history, taking immunosuppressive medications currently, who presented with AP and upper gastrointestinal bleeding. The predicted probability for death by the ML model was 11% compared to the baseline of 7.8%. Her mortality risk detected by the model including diastolic blood pressure of 51 mmHg, systolic blood pressure of 100 mmHg, glucose of 221 mg/dL, ALP of 162 U/L, need for RRT and vasopressors, as well as the comorbidities of liver cirrhosis. The patient’s lower age of 40 years, total bilirubin of 0.7 mg/dL, and platelets of 148 K/µL partly offset the mortality risk predicted by the model. The model correctly predicted the outcome as death, and she died ten days after admission to the ICU (true positive). The visualization of the whole ICU stay of this patient was shown in Supplementary Fig. 6, the vital sign of HR and blood pressure varied broadly during the hospitalization in ICU.
The explainability in the ML model can help clinicians understanding why the model came to the conclusion. From a clinician’s point of view, considering the blood pressure in the presence of vasopressors use, high glucose, need for RRT and cirrhosis, the patient should be monitored in the ICU.
In clinical practice, each patient would get a prediction outcome according to the ML algorithm just like Fig. 4, we can detect how the features that came to the conclusion and evaluate its reasonableness based on our clinical experiences.
Strengths
First, unlike prior reports [9, 11] on ML model in AP, we trained the model using a large cohort, and performed external validation for its generalization in an independent patient group. Second, we used SHAP method to show how the prediction was made by the model for specific instances. The reasoning behind the predictions makes it more acceptable by clinicians in clinical practice. Third, the XGBoost model outperformed existing clinical scoring systems in predicting mortality risk in AP patients, particularly in identifying low-risk patients who may not require ICU-level care. The model’s strong calibration and consistent performance across patient subgroups (data not shown) further support its potential for clinical application.
Limitations
First, despite the improvement in performance for the specificity of 38% at the high sensitivity cutoff, it is less than optimal. This means that there are still several patients who will not deteriorate or die are not identified as low risk and are still unnecessarily monitored in ICU. By including additional variables or using other algorithms (ensemble methods) might further optimizing the model. However, the improvement in specificity for the XGBoost model compared with currently used clinical scores would potentially enhance the cost-effectiveness of ICU care for patients with AP. The trade-off between sensitivity and specificity should be considered flexibly based on the ICU resources’ availability. Second, the etiology of AP was not available in the database. Besides, the data used was retrospectively collected and several features have some degree of missingness. This may somewhat affect the stability and precision of the prediction, leading to false positives or negatives that might affect patient care. Further improvement of the model by adding prospective high-quality data would be necessary for its application in clinical setting. Third, the MIMIC and eICU-CRD database are from independent cohort, however, would the ML model perform well in other populations (e.g., outside US) need to be validated in the future. Fourth, including additional factors (e.g., chronic kidney disease or obesity) or more granular comorbidity classifications could improve the model’s predictive accuracy. However, limited by the database, we just included common comorbidities in the model construction. Finally, expanding the application of our XGBoost model to emergency departments and initial assessments might improve AP care across the healthcare continuum. By providing actionable risk stratification at the earliest stages, the model can enhance triage accuracy, reduce unnecessary ICU utilization, and ultimately save lives. However, the data availability in other clinical settings may impair our model’s generalizability.
Conclusion
In summary, machine learning model has been proved to be superior to existing prediction scores for mortality prediction of AP. The use of most of previous ML models is limited in clinical practice, mainly due to the lack of explainability in the clinical setting. Our study used XGBoost model with SHAP method in a critically ill AP cohort to predict mortality, and identified variables that facilitated the model to make a reasonable prediction, thus making it more transparent and reliable. In the future prospective data are warranted to refine the model, thus to integrate it into clinical decision support systems.
Data availability
Data were fully available at https://mimic.mit.edu and https://eicu-crd.mit.edu/. The datasets generated and analyzed during the current study are available from the corresponding author on reasonable request. Data is provided within the manuscript or supplementary information files.
Abbreviations
- AP:
-
Acute pancreatitis
- ALT:
-
Alanine aminotransferase
- AST:
-
Aspartate aminotransferase
- ALP:
-
Alkaline phosphatase
- BUN:
-
Blood urea nitrogen
- BISAP:
-
Bedside Index for Severity in Acute Pancreatitis
- eICU-CRD:
-
eICU Collaborative Research Database
- GCS:
-
Glasgow Coma Scale
- HR:
-
Heart rate
- INR:
-
International normalized ratio
- IQR:
-
Interquartile range
- INR:
-
International normalized ratio
- ICU:
-
Intensive care unit
- ML:
-
Machine learning
- MAP:
-
Mean arterial pressure
- MIMIC:
-
Medical Information Mart for Intensive Care
- PT:
-
Prothrombin time
- RR:
-
Respiratory rate
- RRT:
-
Renal replacement therapy
- SpO2:
-
Oxygen saturation
- SCr:
-
Serum creatinine
- SHAP:
-
SHapley Additive exPlanations
- WBC:
-
White blood cells
References
Li C, Jiang M, Pan C, Li J, Xu L. The global, regional, and National burden of acute pancreatitis in 204 countries and territories, 1990–2019. BMC Gastroenterol. 2021;21(1):332.
Boxhoorn L, Voermans RP, Bouwense SA, Bruno MJ, Verdonk RC, Boermeester MA, et al. Acute pancreatitis. Lancet. 2020;396(10252):726–34.
Buter A, Imrie CW, Carter CR, Evans S, McKay CJ. Dynamic nature of early organ dysfunction determines outcome in acute pancreatitis. Brit J Surg. 2002;89(3):298–302.
Banks PA, Bollen TL, Dervenis C, Gooszen HG, Johnson CD, Sarr MG, et al. Classification of acute pancreatitis-2012: revision of the Atlanta classification and definitions by international consensus. Gut. 2013;62(1):102–11.
Di MY, Liu H, Yang ZY, Bonis PA, Tang JL, Lau J. Prediction models of mortality in acute pancreatitis in adults: A systematic review. Ann Intern Med. 2016;165(7):482–90.
Wilson C, Heath DI, Imrie CW. Prediction of outcome in acute pancreatitis: a comparative study of APACHE II, clinical assessment and multiple factor scoring systems. Brit J Surg. 1990;77(11):1260–4.
Ranson JH, Pasternack BS. Statistical methods for quantifying the severity of clinical acute pancreatitis. J Surg Res. 1977;22(2):79–91.
Wu BU, Johannes RS, Sun X, Tabak Y, Conwell DL, Banks PA. The early prediction of mortality in acute pancreatitis: a large population-based study. Gut. 2008;57(12):1698–703.
Xu F, Chen X, Li C, Liu J, Qiu Q, He M et al. (2021) Prediction of Multiple Organ Failure Complicated by Moderately Severe or Severe Acute Pancreatitis Based on Machine Learning: A Multicenter Cohort Study. Mediat Inflamm. 2021;5525118.
Qiu Q, Nian Y, Guo Y, Tang L, Lu N, Wen L, et al. Development and validation of three machine-learning models for predicting multiple organ failure in moderately severe and severe acute pancreatitis. BMC Gastroenterol. 2019;19(1):118.
Pearce CB, Gunn SR, Ahmed A, Johnson CD. Machine learning can improve prediction of severity in acute pancreatitis using admission values of APACHE II score and C-Reactive protein. Pancreatology. 2006;6(1–2):123–31.
Collins GS, Reitsma JB, Altman DG, Moons KG. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD statement. BMJ. 2015;350:g7594.
Motwani M, Dey D, Berman DS, Germano G, Achenbach S, Al-Mallah MH, et al. Machine learning for prediction of all-cause mortality in patients with suspected coronary artery disease: a 5-year multicentre prospective registry analysis. Eur Heart J. 2017;38(7):500–7.
Rotondano G, Cipolletta L, Grossi E, Koch M, Intraligi M, Buscema M, et al. Artificial neural networks accurately predict mortality in patients with nonvariceal upper GI bleeding. Gastrointest Endosc. 2011;73(2):218–26.
Shung DL, Au B, Taylor RA, Tay JK, Laursen SB, Stanley AJ, et al. Validation of a machine learning model that outperforms clinical risk scoring systems for upper Gastrointestinal bleeding. Gastroenterology. 2020;158(1):160–7.
Deshmukh F, Merchant SS. Explainable machine learning model for predicting GI bleed mortality in the intensive care unit. Am J Gastroenterol. 2020;115(10):1657–68.
Chen T, Guestrin C, XGBoost:. A scalable tree boosting system. KDD’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016.
Lundberg S, Lee S. A Unified Approach to Interpreting Model Predictions. 2017.
Pedregosa F, Varoquaux G, Granfort A, et al. Scikit-learn: machine learning in Python. JMLR. 2011;12:2825–30.
DeLong ER, DeLong DM, Clarke-Pearson DL. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics. 1988;44(3):837–45.
Forsmark CE, Baillie J. AGA Institute technical review on acute pancreatitis. Gastroenterology. 2007;132(5):2022–44.
Besselink M, Santvoort HV, Freeman M, et al. IAP/APA evidence-based guidelines for the management of acute pancreatitis. Pancreatology. 2013;13(4 Suppl 2):e1–15.
Leppäniemi A, Tolonen M, Tarasconi A, Segovia-Lohse H, Gamberini E, Kirkpatrick AW, et al. 2019 WSES guidelines for the management of severe acute pancreatitis. World J Emerg Surg. 2019;14:27.
Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Proc Adv Neural Inf Process Syst. 2017;4768–77.
Strumbelj E, Kononenko I. Explaining prediction models and individual predictions with feature contributions. Knowl Inf Syst. 2014;41:647–65.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Natural Science Foundation of China (82000479), the Natural Science Foundation of Zhejiang Province (LMS25H180003), and the Fundamental Research Funds for the Central Universities (2022ZFJH003).
Author information
Authors and Affiliations
Contributions
M.J. and C.L.L conceived the study idea and design, drafted the manuscript. M.J. obtained funding. M.J. performed acquisition of data. C.L.L and X.P.W performed analysis and interpretation of data. M.J and X.P.W performed critical revision of the manuscript for important intellectual content and statistical analysis. M.J. and C.L.L took administrative, technical, or material support tasks.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Since the study was an analysis of the third party anonymized publicly available database with pre-existing institutional review board (IRB) approval, IRB approval from The First Affiliated Hospital of Zhejiang university was waived.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Jiang, M., Wu, Xp., Lin, Xc. et al. Explainable machine learning model for predicting acute pancreatitis mortality in the intensive care unit. BMC Gastroenterol 25, 131 (2025). https://doi.org/10.1186/s12876-025-03723-3
Received:
Accepted:
Published:
Version of record:
DOI: https://doi.org/10.1186/s12876-025-03723-3






