Introduction

Lumbar spinal stenosis (LSS) is one of the most common degenerative diseases of the spine, often leading to chronic pain and gait disturbances that significantly impair quality of life [1,2,3]. When symptoms progress, surgical decompression with or without spinal fusion becomes necessary to alleviate neurological deficits and restore function [1,2,3,4]. LSS is particularly prevalent in the elderly population, with its incidence increasing with age due to progressive degenerative changes in the spine [1, 5]. Elderly patients often present with multiple comorbidities and reduced physiological reserves, posing a significant risk of perioperative complications, including surgical site infections, cardiopulmonary events, and prolonged recovery times [5,6,7,8,9,10]. Therefore, accurate risk stratification in this population is crucial for optimizing surgical decision-making and improving postoperative outcomes.

Various scoring systems have been utilized to comprehensively assess the health status of elderly patients and predict postoperative complications and transfusion risk. Commonly used tools include the American Society of Anesthesiologists Physical Classification System (ASA classification), the American College of Surgeons National Surgical Quality Improvement Program (ACS-NSQIP) surgical risk calculator, and the Charlson Comorbidity Index (CCI), all of which provide structured risk assessments based on preoperative patient characteristics [11,12,13]. Particularly for patients undergoing surgical intervention of LSS, comprehensive geriatric assessment (CGA) has been proposed to evaluate the risk of early postoperative complications [5]. Unlike traditional methods, CGA aims to integrate multiple dimensions of health, including frailty, activities of daily living (ADL), mini-mental state examination (MMSE), nutritional status, and medication burden [9, 10, 14,15,16].

However, despite their clinical effectiveness, the aforementioned methods rely on a limited number of predefined variables and simple statistical models, which may fail to adequately account for the complexity of health conditions in elderly patients. Contrastingly, machine learning (ML) offers a data-driven approach that can integrate numerous clinical factors and capture intricate, nonlinear relationships among them [17, 18]. Among various techniques, decision tree-based algorithms, including random forests and gradient boosting machines have gained attention due to their robustness and interpretability [19,20,21,22].

Recently, a growing body of research has demonstrated the effectiveness and potential of ML in the context of geriatric surgery. For instance, in gastrointestinal cancer surgeries, ML models have outperformed traditional bedside scoring systems in predicting postoperative morbidity and mortality [23]. Similarly, several studies have reported the superior performance of ML algorithms in predicting delirium, a common postoperative complication among elderly patients [24, 25]. Furthermore, there have been efforts to apply ML to forecast cardiovascular events in non-cardiac surgeries, as well as to predict the onset of acute kidney injury [26, 27]. Collectively, these findings highlight the substantial potential of ML as a powerful tool for personalized risk assessment and decision-making in elderly patients, who often present with diverse and complex clinical profiles. In the context of spinal surgery, ML have been applied to various prediction tasks, including estimating prolonged length of hospital stay, predicting the likelihood of blood transfusion after adult spinal deformity surgery, and forecasting changes in patient-reported outcomes such as the SRS-22R questionnaire [28,29,30,31]. However, most of the preliminary studies were not focused on the geriatric population, and we aimed to specify the cohort as the elderly are more vulnerable to adverse events after surgery.

Using a prospectively collected cohort of elderly patients undergoing surgery for LSS, we develop and evaluate ML models predicting early postoperative complications and the need for blood transfusion during the hospitalization period. By leveraging ML techniques, we seek to improve the accuracy of perioperative risk assessment compared to traditional methods, particularly the ACS-NSQIP scoring system. Furthermore, Shapley additive explanations (SHAP) analysis is performed to identify key risk factors contributing to early postoperative outcomes and, if possible, provide clinically actionable insights [32].

Materials and methods

Patient cohort

Patients aged 65 years or older who underwent any type of elective surgery for LSS between June 2015 and December 2018 were prospectively enrolled. Those treated for spinal deformities or stenosis at non-lumbar levels, as well as patients requiring emergency operations, were excluded to ensure a more homogeneous study population. The study was approved by the Institutional Review Board (IRB) of our organization, and informed consent was obtained from all participants.

Data acquisition

A wide range of patient characteristics were collected one day before surgery. First, basic demographic profiles, including age, sex, height, weight, and body mass index (BMI), were recorded. Additionally, various scoring systems for evaluating the medical and functional status of geriatric patients were employed. Activities of daily living (ADL), instrumental activities of daily living (IADL), and economic dependency were assessed, with nonzero values indicating functional dependency [14]. Short-form geriatric depression scale (GDS, range: 0–15) and mini-mental state examination (MMSE, range: 0–30) were used to assess the patient’s psychological state, with a GDS score of 5 or greater indicating depression and MMSE score of 23 or below suggesting cognitive impairment [15, 33]. Mini nutritional assessment (MNA, range: 0–30) was performed to assess the patient’s general nutritional status, with a higher score indicating more adequate nutrition [16]. As comorbidity status is crucial in evaluating elderly patients, CCI and the number of medications were also recorded [13]. Frailty measures, including the modified frailty index (mFI), timed-up-and-go test (TUGT) time in seconds, fall history within 6 months before surgery, and fracture due to fall within 6 months before surgery, were assessed [9].

Preoperative features such as ASA classification and laboratory test results were collected one day prior to surgery [12]. The laboratory tests included leukocyte count, hemoglobin, blood glucose, blood urea nitrogen (BUN), serum creatine, estimated glomerular filtration rate (eGFR), total cholesterol, serum protein, serum albumin, aspartate aminotransferase (AST), and alanine aminotransferase (ALT). Furthermore, features relevant to surgical plans, including spine fusion and the number of involved vertebra levels, were also considered as inputs. In total, 48 variables were collected from each patient. Comprehensive interviews and complete retrieval of test results were conducted for every individual, taking approximately one hour per patient, resulting in no missing data. Min-max scaling was applied to all of the values into a range between 0 and 1.

In this study, several outcome measures were predicted. First, postoperative complications occurring within 30 days were evaluated during the hospitalization period and at outpatient visits 1 month after surgery. These complications were categorized into general and surgical complications, with events classified as Clavien-Dindo grade 2 (complication requiring pharmacological treatment with drugs other than analgesics, antiemetics, antipyretics, diuretics, and electrolytes) or higher considered positive for postoperative complications [34]. Additionally, the need for transfusion during hospitalization was recorded as a binary variable, capturing any instance of red blood cell (RBC), fresh frozen plasma (FFP), or platelet transfusion. As a result, three outcome variables were studied in this research: (1) general or surgical postoperative complications within 30 days (2) general postoperative complications within 30 days (3) the need for transfusion within the hospitalization period.

Model development

The patient cohort was randomly split into train-validation and test sets in a roughly 4:1 ratio, to ensure sufficient data for model training while preserving an independent set for final evaluation. Although the dataset was imbalanced, stratification and oversampling were not performed, as the same dataset was intended to be used for multiple tasks. Instead, to address the class imbalance during model training, a class weight ranging from 3 to 10 was assigned to the positive class, depending on the specific task and outcome distribution. To assess potential differences in input and output variables between the train-validation and test sets, statistical analyses were performed. Continuous variables were compared using independent t-tests, while categorical variables were analyzed using chi-square tests. P-values below 0.05 were considered statistically significant.

In this study, two types of predictive models were developed. The Complex model utilized all available input features collected, whereas the Compact model was designed to use a simplified set of 28 clinically relevant features, selected to eliminate redundancy in the input space and promote applicability for future studies. The predictive performance of both models was compared against the probability of “serious complication” as estimated by the ACS-NSQIP score, a widely recognized benchmark for predicting complications in non-cardiac surgery [11].

For each of the three outcome variables, five different machine learning algorithms were trained: logistic regression (LogReg), support vector machine (SVM) with radial basis function kernels, random forest classifier (RF), extreme gradient boosting (XGBoost) classifier, and light gradient boost machine (LightGBM) [20,21,22]. To maximize the utilization of available data, 5-fold cross-validation was applied to the train-validation set. Models were trained to minimize binary cross-entropy loss, and hyperparameter tuning was performed using Optuna (version 4.2.1), with 200 optimization trials conducted within a sufficiently broad hyperparameter search space, as detailed in Table 1 [35]. The combination of an extensive search space and a large number of trials was designed to ensure a thorough exploration of the parameter landscape and a high likelihood of converging to a near-optimal solution. The outputs from models trained in each fold were combined using soft voting to generate the final prediction probabilities.

Table 1 Hyperparameter search space

Evaluation of model performance

The area under the receiver-operating curve (AUROC) and the area under the precision-recall curve (AUPRC) on the test was used as the primary performance measure in this study, as these metrics are independent of the decision threshold. Additionally, several auxiliary metrics, including accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value, and F1-score, were evaluated. They were calculated at Youden’s J point, a decision threshold that maximizes the sum of sensitivity and specificity [36].

First, the non-inferiority of the Compact model compared to the Complex model was statistically evaluated. Similarly, the performance of the Compact model was also compared against the ACS-NSQIP score, which served as a reference benchmark. In addition, performance differences among various machine learning algorithms within the Compact model framework were assessed. Since the dataset was relatively small, 95% confidence intervals for performance metrics, as well as their comparisons, were estimated using bootstrap resampling.

To assess the interpretability of the models, SHAP analysis was conducted on the Compact model. It evaluates the influence of each input feature on the final prediction, and a greater mean absolute SHAP value indicates a greater contribution to the model’s decision-making process [32]. For each trained model, the top 10 most influential features were identified, and their contributions were qualitatively examined using beeswarm plots generated from the test set.

Results

Cohort characteristics

A total of 278 patients enrolled in the study, and 17 were excluded due to the cancellation of surgery or change of treatment plans, resulting in 261 eligible for the study. The mean age of the cohort was 72.3 ± 4.8 years, and 108 (41.3%) patients were male. In total, 27 (10.3%) patients experienced either general or surgical postoperative complications of Clavien-Dindo grade 2 or higher within 30 days, with 20 (7.7%) general and 7 (2.7%) surgical. General complications consisted of 5 (1.9%) delirium, 5 (1.9%) cardiovascular, 4 (1.5%) urinary, 3 (1.1%) respiratory, 2 (0.8%) gastrointestinal, and 1 (0.4%) sacral fracture complications. Among surgical complications, neurologic symptoms were the most common, with 4 (1.5%) cases, followed by infection, hematoma, and pneumoperitoneum, each with 1 (0.4%) case. Transfusion was performed in 79 (30.2%) patients within the hospitalization period.

After the dataset was split, the train-validation and test sets contained 208 (79.7%) and 53 (20.3%) patients, respectively. Among the collected variables, only age (72.0 ± 4.8 vs. 73.5 ± 4.8 years; P = 0.049) and TUGT time (21.8 ± 44.0 vs. 13.7 ± 5.6 s; P = 0.02) showed significant differences. No evidence of a significant difference was found in general or surgical complication rates (10.6% vs. 9.4%; P = 1.00), general complication rates (7.7% vs. 7.5%; P = 1.00), and transfusion rates during hospitalization (31.7% vs. 24.5%; P = 0.39). A detailed comparison of all features between the train-validation and test sets is provided in Tables 2 and 3. The distribution of the output variable across the five cross-validation folds was examined using Fisher’s exact test, and no statistically significant differences were observed (P = 0.99, P = 0.72, and P = 0.98, respectively).

Table 2 Comparison of patient characteristics of the train-validation and test sets
Table 3 Comparison of perioperative features between train-validation and test sets

Model performance

The performance of the best models trained with each algorithm on each task is summarized in Tables 4, 5 and 6, and the receiver-operating characteristic (ROC) curve of the Compact model are plotted in Fig. 1.

Table 4 Performance on predicting general/surgical complications on the test set
Fig. 1
figure 1

Receiver-operating characteristic curves on the test set (Compact model). Receiver-operating characteristic curves for predicting (a) early general and surgical complications, (b) early general complications, and (c) transfusion within hospitalization period. LogReg, logistic regression; SVM, support vector machine; AUC, area under the receiver-operating characteristic curve

Table 5 Performance on predicting general complications on the test set
Table 6 Performance on predicting transfusion within hospitalization period on the test set

Compared to the Complex model, the Compact model did not demonstrate statistically significant inferiority in terms of AUROC or AUPRC for any of the tested machine learning algorithms. Furthermore, when compared to the ACS-NSQIP scoring system, which achieved a test AUROC of 0.38 (95% confidence interval [CI], 0.13–0.73) and 0.22 (95% CI, 0.10–0.36) on the first two tasks, the Compact model showed significantly greater AUROC values nearing or surpassing 0.90 for all algorithms except for LogReg (P = 0.10 and P = 0.61, respectively for each task). Although the Compact model generally yielded higher AUPRC values as well, statistical significance was observed only in the case of the SVM trained on the first task (P = 0.044).

Overall, decision tree-based algorithms, including RF, XGBoost, and LightGBM, consistently demonstrated greater performance than LogReg, while the performance of SVMs fluctuated depending on the task. For LogReg, only XGBoost for the second task—predicting general postoperative complications within 30 days—showed statistically significantly higher performance, with AUROC (0.60 vs. 0.95; P = 0.03) and AUPRC (0.13 vs. 0.51; P = 0.04). In the other tasks, no statistically significant improvements over LogReg were observed. The superiority of tree-based models over SVM was shown in the task of predicting transfusion. In terms of AUROC, all tree-based models demonstrated statistically significantly higher performance: RF (0.76 vs. 0.93; P = 0.03), XGBoost (0.76 vs. 0.96; P = 0.005), and LightGBM (0.76 vs. 0.92; P = 0.049). For AUPRC, a statistically significant improvement was observed only with XGBoost (0.45 vs. 0.80; P = 0.04).

Explainability analysis

Table 7 lists the top 10 most influential features of the Compact model for each ML algorithm identified by SHAP analysis. The results indicate that the key contributing factors were generally consistent across different algorithms, particularly among decision tree-based models. For early postoperative complications, important predictors included frailty measures such as fall history within 6 months and TUGT time, as well as preoperative laboratory test results and fusion operation. In the case of transfusion prediction, hemoglobin level, nutrition, and involved spinal levels were identified as major contributing factors.

Table 7 List of top 10 contributing features of the Compact model for each algorithm and task

Figures 2, 3 and 4 present the beeswarm plots and mean absolute SHAP values obtained from each task, comparing LogReg and XGBoost. As demonstrated in Figs. 2a, 3a, and 4a, LogReg consistently exhibited poor group separation in the beeswarm plots, with SHAP values concentrated in only a few variables. Contrastincly, XGBoost demonstrated clear group separation across all tasks, with SHAP values more evenly distributed across multiple variables, as illustrated in Figs. 2b, 3b, and 4b. This suggests that XGBoost exhibits greater interpretability than LogReg. SHAP analysis results for SVM, random forest, and LightGBM are provided in the additional figures dedicated for demonstrating SHAP results, and decision tree-based models like random forest and LightGBM have shown similar patterns to XGBoost (see Additional files 1, 2, and 3).

Fig. 2
figure 2

SHAP results on predicting early general and surgical postoperative complications (Compact model). Shapley additive explanation analysis results for (a) the logistic regression model and (b) XGBoost classifier on predicting early general and surgical postoperative complications

Fig. 3
figure 3

SHAP results on predicting early general postoperative complications (Compact model). Shapley additive explanation analysis results for (a) the logistic regression model and (b) XGBoost classifier on predicting early general postoperative complications

Fig. 4
figure 4

SHAP results on predicting transfusion during hospital stay (Compact model). Shapley additive explanation analysis results for (a) the logistic regression model and (b) XGBoost classifier on predicting transfusion within hospital stay

Discussion

In clinical research, statistical analysis methods have gradually shifted from traditional approaches, such as LogReg, to ML techniques due to their capability to capture nonlinear complex relationships in high-dimensional data [17, 18]. This shift is particularly relevant in the management of elderly patients, where multiple interrelated factors must be considered to achieve optimal outcomes [2, 5, 37, 38]. Accordingly, we applied ML to analyze our prospectively collected cohort, aiming to predict early postoperative complications and transfusion requirements in elderly patients undergoing surgery for LSS.

Comparison between models

As demonstrated earlier, the ML models consistently outperformed the reference method, ACS-NSQIP, in both AUROC and AUPRC. In particular, for the first two tasks involving the prediction of early postoperative complications, ACS-NSQIP yielded AUROC values below 0.5, indicating substantially limited predictive capability. It should be noted that AUROC and AUPRC were adopted as the primary evaluation metrics because they are independent of the decision threshold. Although auxiliary metrics, such as accuracy, sensitivity, and specificity, were measured at Youden-J point, careful selection of the decision threshold is warranted, with due consideration of the potential consequences associated with false predictions.

No evidence of statistically significant differences in AUROC and AUPRC was observed between the Complex and the Compact model. In other words, the exclusion of time- and labor-intensive variables such as ADL, IADL, MMSE, and CCI did not compromise predictive performance. These findings suggest that the Compact model may offer a more practical and scalable approach for future research and clinical implementation. Nevertheless, much validation and clinical studies must be preceded before real application.

When comparing the performance across ML algorithms, decision tree–based models such as XGBoost and LightGBM generally outperformed others, particularly in the task of predicting transfusion, where they showed superior performance compared to LogReg and SVM. Of particular note are the SHAP analysis results, which illustrate model interpretability: while LogReg tended to rely heavily on a single variable, tree-based models distributed importance more evenly across multiple variables, reflecting a more holistic consideration of diverse clinical features. Considering both predictive performance and interpretability, models such as XGBoost and LightGBM appear to be more suitable for clinical application.

Clinical implications

This section discusses the clinical implications and potentially actionable insights derived from the SHAP analysis of the Compact model based on RF, XGBoost, and LightGBM algorithms. By examining the contribution of individual features to the model’s predictions, we aim to identify clinically relevant patterns and risk factors that may inform preoperative decision-making and patient-specific care strategies.

First, frailty-related indicators, such as fall history within six months, and TUGT time, were one of the major contributing factors for early postoperative complications. This finding aligns with previous studies and highlights the importance of assessing frailty and muscle waste in elderly patients before lumbar spine surgery [9, 10, 39, 40]. Dependency in daily living and poor nutritional status were also associated with a greater risk of early postoperative complications and transfusion, highlighting the need for perioperative fitness for better recovery [5, 41]. Therefore, in patients with multiple risk factors, it would be prudent to actively consider non-surgical treatment options first and reserve surgical intervention only for cases with absolute indications, such as the occurrence of neurological complications.

Interestingly, age and comorbidity burden were found to be less influential than expected, with only diabetes showing notable contributions to the outcome variables [42, 43]. Instead, preoperative cholesterol, kidney-related markers, and nutritional status played more significant roles, suggesting the importance of effective management of comorbidities in the geriatric population [41, 44]. Specifically, cholesterol levels may have emerged as a contributing factor probably due to their link with cardiovascular health [45]. This finding emphasizes the importance of lipid management, even in patients without a history of cardiovascular disease, as a potentially actionable strategy to mitigate surgical risks in elderly patients [46,47,48].

Regarding transfusion risk, preoperative hemoglobin level was identified as a primary determinant, which is consistent with contemporary clinical practices and literature [29, 49]. Furthermore, patients who underwent procedures involving multiple vertebral levels, had a higher likelihood of requiring transfusion during hospitalization. Poor nutritional status and prolonged TUGT time was also associated with increased risk, underscoring the importance of preoperative nutritional management [34, 50, 51]. Thus, sufficient correction of nutrition should be considered prior to elective LSS in elderly patients to reduce the likelihood of transfusion events.

Limitations and future work

This section outlines the limitations of the study and directions for future work. First, although statistical analyses have demonstrated that the developed ML models outperformed the ACS-NSQIP score in terms of AUROC, the generalizability of the findings is limited by the relatively small sample size (n = 261). The single-center design without external validation and the class imbalance in the dataset may have introduced considerable bias. Furthermore, the study included only patients who underwent elective surgery, raising concerns about potential selection bias. It is also possible that clinical decisions made by healthcare providers based on ASA classification or surgical planning may have led to an underestimation of the actual risk, further limiting the objectivity of the outcome labels.

The limited dataset size was largely due to the prospective enrollment of elderly patients and the extensive time and resources required to collect a wide range of questionnaire-based features. In particular, constructing the input for the Complex model necessitated a minimum of one hour of thorough, face-to-face interviews per patient to obtain data such as ADL, IADL, MNA, MMSE, and CCI. To address the associated burden, we developed the Compact model using only routinely available variables such as blood test results and clinically essential features. Notably, statistical analyses revealed no significant inferiority in predictive performance compared to the Complex model.

Therefore, this study not only demonstrates the value of ML in the proposed tasks as a proof-of-concept but also lays the groundwork for scalable and cost-effective research in larger cohorts. Although immediate clinical application may be limited, we believe that, with the development of multi-institutional, multiethnic datasets and sufficient validation through clinical trials, this approach holds promise for enabling individualized care for the geriatric population. Further integration into electronic health records or regulatory approval should also be followed.

Another downside of this study lies in the restricted scope of output variables. Due to the relatively small dataset, it was not feasible to develop models capable of predicting organ-specific complications. With a larger cohort in future research, such granularity may become attainable, allowing the model not only to assess the overall risk of complications but also to identify specific systems or organs that require heightened clinical attention. Additionally, in the geriatric population, long-term outcomes such as morbidity and mortality are of particular clinical relevance. The inability to evaluate these outcomes within the current study further limits its scope. Future investigations should therefore aim to expand not only the sample size but also the follow-up period to comprehensively assess both short- and long-term risks.

Conclusion

ML offers a data-driven approach that can integrate numerous factors and capture intricate, nonlinear relationships among them. In this study, we have developed ML algorithms that predict early postoperative complications and transfusion within the hospitalization period in elderly patients undergoing elective surgery for LSS and observed that they can provide higher predictive performance than well-established scoring systems such as ACS-NSQIP. Although the scope of the study is limited by the small cohort collected from a single institution, it may serve as a foundation for future larger-scale reasearch and hopefully clinical application.