Background

The emergency department (ED) of most hospital facilities serve as the frontline of medical care, its importance remains undeniable. Despite its crucial role, predicting the demand for emergency services is challenging, as it is both random and unpredictable. Compounded by finite medical staffing and resources, sudden influxes of patients can overwhelm the ED, making it difficult to meet medical needs promptly. This scenario often leads to ED crowding, where the demand for emergency services exceeds the ED’s capacity to provide timely care [1]. ED over-crowding has a significant negative impact on overall healthcare quality, such as a negative patient experience [2], reduced care quality [3], and unpleasant staff experiences [4]. Despite the adoption of various ED crowding intervention techniques, including technology-based, physical-based, or flow modification methods [5], ED crowding remains a considerable challenge for most EDs worldwide.

The use of artificial intelligence (AI) has been applied across various fields. Advancements promoted by AI have been especially felt in healthcare, for predictive purposes, to yield promising results. One such application is the utilization of machine learning in predicting ED dispositions [6,7,8,9,10], which has shown significant achievements. The logical next step is to systematically and objectively consolidate those study findings in order to provide reference for both medical practice and academia alike. However, existing reviews on the application of machine learning in predicting ED dispositions maintain certain shortcomings. For instance, most studies only analyze the performance of prediction models through systematic review, lacking more objective quantitative analysis through meta-analysis to consolidate the stated performance of those models. Additionally, there is a lack of comprehensive analysis regarding different ED dispositions (e.g., admission, critical care, and mortality). Dispositions in this study are limited to the outcomes of patients’ visits to the ED, specifically focusing on hospital admission, critical care, and mortality. Diagnoses or presenting conditions (e.g., sepsis) are not included in the scope of this analysis.

Therefore, this study aims to systematically evaluate the diagnostic performance of AI models in predicting key ED dispositions—hospital admission, critical care, and mortality—through a comprehensive meta-analysis. Specifically, the study seeks to: (1) Quantify the overall diagnostic accuracy of AI models in predicting ED dispositions, enabling a clearer understanding of their general capabilities, (2) Identify and analyze covariates (e.g., data characteristics, model types, and study settings) that contribute to heterogeneity in AI performance across studies, and (3) Provide actionable insights and practical recommendations for stakeholders—including clinicians, researchers, and administrators—on how AI applications can be better utilized to improve ED workflows and decision-making processes. The research questions of this study include: (1) What is the performance of AI applications in predicting admission, critical care, and mortality?; and, (2) What covariates can account for the heterogeneity between studies? Given the varied application settings of each predictive model, this meta-analysis is intended to present an overall view of AI applications rather than offering tailored recommendations for individual clinical situations. By synthesizing results from diverse contexts, this meta-analysis aims to highlight general trends, identify strengths and limitations, and outline key areas for future research. While this study offers insights into the overall status of AI applications in predicting ED dispositions, specific applications in particular settings would still require further development to meet unique contextual needs.

The primary contributions of this study are as follows: (1) Providing an objective, quantitative evaluation of AI performance in predicting ED dispositions, helping readers understand the general capabilities, advantages, and limitations of current AI applications in this context. (2) Identifying covariates from different data and technical perspectives that may influence AI performance to offer strategies to enhance the performance of AI predictive models. (3) Compiling and synthesizing existing knowledge on AI predictions in ED dispositions, creating a practical resource to guide clinicians, administrators, and researchers considering AI adoption to improve ED workflows.

Related works

To provide a comprehensive evaluation of the application of AI in predicting ED dispositions, this study adopts both a macro and a micro perspective. The macro perspective focuses on synthesizing findings from existing review studies, which offer a high-level understanding of trends, methodologies, and limitations in the field. Complementing this, the micro perspective examines original studies to provide a granular view of AI model performance, including key metrics such as sensitivity, specificity, and area under the receiver operating characterisitc curve. By integrating both perspectives, this study aims to bridge the gap between broad trends and specific evidence, offering a holistic understanding of AI’s role in predicting ED dispositions.

Macro perspective - Review studies

Previous literature has systematically reviewed and/or meta-analyzed the dispositions of ED patients, thereby contributing to a better understanding of the topic. As such, these studies also indicate areas for further research and improvement. For example, Shung et al. [11] conducted a systematic review analyzing machine learning’s use in predicting outcomes of acute gastrointestinal-bleeding patients, finding an area under the curve of approximately 0.84 for mortality, interventions, or re-bleeding prediction. However, the study lacked meta-analysis, limiting comprehensive evaluation, especially regarding ED disposition. Guo et al. [12] reviewed machine-learning applications in predicting heart failure diagnoses, readmissions, and mortality, affirming its effectiveness. Yet, the review summarized results without comprehensive statistical analyses, particularly regarding heart failure patients.

Kareemi et al. [13] reviewed machine-learning’s diagnostic and prognostic applications in ED patients, showing superior performance but lacking meta-analyses for a more objective assessment. Naemi et al. [14] reviewed studies predicting in-hospital mortality among ED patients using vital signs and noting reporting shortcomings. Despite proposing future directions, it lacked meta-analysis to objectively quantify predictive capabilities. Buttia et al. [15] focused on machine-learning predictions of COVID-19 outcomes, highlighting limitations in model generalizability. But, as noted, it didn’t deeply assess machine-learning performance.

Chen et al. [16] reviewed studies predicting ICU transfers among ED patients, demonstrating promising performance but lacking both a broader perspective and meta-analytic techniques. Issaiy et al. [17] reviewed machine-learning predictions for acute appendicitis, emphasizing high accuracy but lacking meta-analysis for comprehensive assessment. Larburu et al. [18] systematically reviewed studies predicting ED patient hospitalizations, noting logistic regression’s common usage but lacking meta-analytical synthesis. Olender et al. [19] reviewed studies predicting mortality among older adults using machine learning, conducting meta-analyses to quantify predictive abilities. While providing valuable insights into mortality prediction, the review’s focus on mortality and lack of specificity regarding in-hospital mortality limits its comprehensive assessment capability. Zhang et al. [20] systematically reviewed and meta-analyzed studies predicting sepsis patients’ mortality using machine learning, demonstrating superior predictive performance compared to existing scoring systems. Despite its comprehensive analysis, the review’s limitation lies in its focus solely on the combination of sepsis and mortality prediction.

Based on the existing review literature, there are notable areas for improvement in synthesizing studies that apply machine learning to predict ED dispositions. To begin with, there is a lack of comprehensive review studies on ED dispositions. Among the ten reviewed studies, only a few focused on specific aspects such as admission, mortality, or critical care, rather than providing a holistic view. Secondly, the meta-analytical approach is under-utilized, with only a minority of the reviewed studies employing them. Meta-analysis has the potential to provide a more objective evaluation of machine-learning predictive models, benefiting both practitioners and academics alike. Detailed information on existing reviews is shown in Table 1.

Table 1 Emergency department disposition-related review studies

Micro perspective - Original studies

From a micro perspective, in studies predicting admission disposition, most utilized private datasets, while only a few studies [21,22,23,24] developed prediction models using public datasets, such as the National Hospital and Ambulatory Medical Care Survey (NHAMCS) ED data and the Medical Information Mart for Intensive Care IV (MIMIC-IV) ED database. Most studies relied on structured features, with some [25,26,27,28,29] combining structured and unstructured features (e.g., free text), while others [30,31,32] used only unstructured features. Regarding model validation, the majority employed internal validation, with only a small number of studies [33] using external validation. Studies that applied cross-validation (e.g., K-fold cross-validation) outnumbered those that did not. Additionally, studies using traditional machine learning methods for ED disposition prediction slightly outnumbered those employing deep learning techniques [21,22,23,24, 26, 30,31,32,33]. A large portion of studies [21, 28, 29, 34,35,36,37,38,39,40,41,42,43,44,45,46,47,48] adopted ensemble methods for building predictive models.

In studies predicting critical care disposition, research utilizing public datasets [9, 21, 23, 24] remained limited, with NHAMCS and MIMIC-IV being the most commonly used public datasets. These studies predominantly relied on structured features, with only a few [49,50,51,52,53,54,55,56,57] combining structured and unstructured features, or solely using unstructured features [58]. Similarly, external validation was infrequently employed [51, 52, 59], with most studies relying on internal validation. The number of studies applying cross-validation was greater than the number of studies that did not. Most predictive models were built using traditional machine learning methods, while the use of ensemble methods in this context was less common [54, 57, 60].

For studies predicting mortality, research utilizing public datasets [6, 61] was significantly less prevalent compared to those using private datasets. These studies primarily relied on structured features, with fewer studies incorporating unstructured features, including free text and imaging data [52, 58, 62, 63]. Model construction predominantly relied on internal validation, with fewer studies adopting external validation [52]. Studies employing cross-validation still outnumbered those that did not. Traditional machine learning approaches were more commonly used than deep learning methods [52, 58, 61, 62, 64,65,66,67]. However, for mortality prediction, studies utilizing ensemble methods outnumbered those that did not.

From the above analysis, it is evident that studies predicting admission, critical care, and mortality dispositions differ significantly in terms of sample sources, feature structuredness, and algorithm choices, leading to varied performance outcomes. Therefore, conducting a meta-analysis to summarize the overall performance of these predictive models is essential. Furthermore, the differences in sample sources and methodologies may contribute to between-study heterogeneity, making it equally important to identify potential factors causing this heterogeneity.

Methods

This study adheres to the reporting guidelines outlined in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses [68, 69] (see Additional file 1 and Additional file 2). Additionally, the research protocol for this study has the approval of E-Da Hospital (EMRP-109-158).

Search strategy and selection process

This study utilized a combination of keywords to search across seven electronic databases, including Scopus, SpringerLink, ScienceDirect, PubMed, Wiley, Sage, and Google Scholar, until December 31, 2023. The primary focus was on three types of ED dispositions: Admission, critical care, or mortality. ‘Admission’ refers to when patients seek treatment at the ED and then transfer to a general ward: ‘critical care’ involves patients with critical conditions requiring ICU transfer with or without the use of intubation or mechanical ventilation; and, ‘mortality’ refers to patients who expire before leaving the ED. Due to potentially diverse keywords for expressing these three ED dispositions, the study employed the keyword combination ‘emergency department’ AND (‘machine learning’ OR ‘deep learning’ OR ‘artificial intelligence’), followed by manual filtering by the researchers.

The inclusion criteria consisted of (1) studies focusing on ED dispositions, (2) studies reported in English, and (3) studies utilizing machine learning or deep learning methods. Exclusion criteria included (1) studies not employing machine learning or deep learning, (2) studies lacking sufficient information on outcome measures, and (3) studies not related to the prediction of ED dispositions. Following these criteria, 12,214 potential articles were identified. After excluding 156 duplicate records and the screening of titles and abstracts, 241 full-text articles remained. These were independently reviewed by two researchers, resulting in the exclusion of 153 articles not meeting the inclusion criteria. Ultimately, 88 articles [6,7,8,9,10, 21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67, 70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105]were selected for subsequent meta-analysis. The literature screening process is illustrated in Fig. 1. The studies included in this review are listed in Additional file 3 and Additional file 4.

Fig. 1
figure 1

Article selection process

Data extraction

For the included articles, this study extracted the following information: author(s), publication year, sample size, type of ED disposition (admission, critical care, or mortality), data source (private or public dataset), data structure for features (structured, unstructured, or combined), type of unstructured feature (free text or image), age group of samples (adult, mixed, youth, elder, or unclear), type of AI techniques adopted (machine learning or deep learning), whether cross-validation was used, and whether ensemble learning was adopted. Additionally, this study captured the numbers of true/false positives and true/false negatives. If not directly provided, this study performed conversions based on existing data located in the articles. As the same article may develop multiple ED disposition models simultaneously, this study treated them as distinct ED disposition models for purposes of inclusion.

Methodological analysis

This study assessed the risk of bias and applicability of the evidence based on the Prediction model risk of bias assessment tool (PROBAST) [106, 107]. It primarily focuses on four domains: participants, predictors, outcomes, and analysis.

Statistical analysis

This study followed recommendations from prior diagnostic test accuracy literature [108] to calculate the following measures for test accuracy: sensitivity, specificity, area under the receiver operating characteristic curve (AUROC), diagnostic odds ratio (DOR), positive likelihood ratio (+ LR), and negative likelihood ratio (-LR). Additionally, forest plots were utilized to depict the variability among the included literature, along with the hierarchical summary receiver operating characteristic curve (HSROC) with 95% confidence intervals (CI) and 95% prediction intervals. Further, to identify potential factors influencing heterogeneity, meta-regression analysis was conducted, incorporating variables such as type of ED disposition, data source, type of feature, type of unstructured feature, type of sample, type of AI techniques employed, whether cross-validation was undertaken, and whether ensemble learning was adopted. All analyses were conducted using R Statistical Software v4.3.2 [109] with the lime4 v1.1-35.1 [110] and the mada 0.5.11 [111] package. The MetaDTA was utilized to create the HSROC [112, 113].

Results

General study characteristics

As illustrated in Table 2, among the 88 included articles, there were a total of 117 models present (see Supplementary file C and D), with 39, 45, and 33 models predicting admission, critical care, and mortality, respectively. The majority of models sourced their data from private sources (88.89%). Features used to construct predictive models for ED dispositions were predominantly structured data (78.63%), while among models employing unstructured features, approximately 68% and 32% utilized free text and image data, respectively. Notably, free text data were processed using natural language processing (NLP) techniques to extract meaningful features for model development. Most models (94.02%) only adopted internal validation (the same dataset) instead of external validation (a completely independent dataset) to develop predictive models. Most models were based on samples of adult (70.94%), and the majority employed machine learning techniques (71.79%) rather than deep-learning techniques. Approximately 63% of models utilized cross validation during the training process, while about 38% employed ensemble learning.

Table 2 Characteristics of included studies

This study further categorizes machine learning and deep learning approaches. From Table 3, it is evident that Random forest (RF) (19.66%) and eXtreme gradient boosting (XGB) (18.80%) are the most commonly used algorithms, followed by Gradient boosting machine (GBM) (11.97%) and LightGBM (5.13%). In the realm of deep learning, the Deep neural network (DNN) (18.80%) has a higher usage rate than the Convolutional neural network (CNN) (6.84%) and Recurrent neural network (RNN) (2.56%).

Table 3 Type of artificial intelligence techniques adopted

Quality assessment

In terms of risk of bias, among the 87 included articles, approximately 70.11% were classified as having a high risk of bias regarding the predictors domain. For the other three domains—participants, outcomes, and analysis—98.85%, 100%, and 97.70% of articles, respectively, were assessed as having a low risk of bias. Overall, 96.55% of articles were assessed as having a low risk of bias, while 3.45% were classified as high risk.

Regarding the risk of applicability, nearly all articles were assessed as having a low risk across the three domains. Specifically, all articles demonstrated low risk concerning participant applicability, while 98.85% showed low risk for predictors and outcomes. Overall, 100% of the articles were assessed as having a low risk of applicability. This assessment of risk of bias and applicability based on the PROBAST tool is summarized in Fig. 2.

Fig. 2
figure 2

Quality assessment by PROBAST

Diagnostic accuracy

Among the three major types of ED disposition prediction models, those forecasting mortality achieved the highest area under the receiver operating characteristic curve (AUROC), followed by models predicting critical care, with admission prediction models exhibiting the lowest performance (see Table 4). The reported statistics for sensitivity, specificity, and AUROC represent pooled summary measures derived from the component studies included in this meta-analysis. The pooled summary AUROC for predicting admission, critical care, and mortality were 0.866 (95% CI 0.836–0.929), 0.928 (95% CI 0.893–0.951), and 0.932 (95% CI 0.894–0.956), respectively. In terms of sensitivity, admission prediction models showed the lowest sensitivity at 0.81 (95% CI 0.74–0.86), followed by critical care models at 0.86 (95% CI 0.79–0.91), and mortality models at 0.85 (95% CI 0.80–0.89). Regarding specificity, admission models exhibited the lowest specificity at 0.87 (95% CI 0.81–0.91), followed by critical care models at 0.89 (95% CI 0.84–0.93), and mortality models at 0.94 (95% CI 0.90–0.96). These statistics are primarily based on models utilizing internal validation, as only 5.98% of the included models performed external validation. In terms of sensitivity, critical care prediction models performed the best, closely followed by mortality prediction models, while admission prediction models showed the lowest sensitivity. Regarding specificity, mortality prediction models demonstrated the highest specificity, followed by critical care prediction models, with admission prediction models exhibiting the lowest specificity.

Analysis of the DOR revealed that models predicting mortality exhibited the highest discriminatory performance [114], while models predicting admission had the lowest DOR. Considering that the + LR, models predicting mortality were better at identifying true mortality cases due to their highest + LR, whereas models predicting critical care had a lower -LR, indicating their better ability to identify non-critical care patients [115]. Forest plots of sensitivity and specificity for models predicting admission, critical care, and mortality are illustrated in Figs. 3, 4 and 5.

Table 4 Performance of predicting ED dispositions by artificial intelligence
Fig. 3
figure 3

Sensitivity and specificity of models predicting admission (n = 39)

Fig. 4
figure 4

Sensitivity and specificity of models predicting critical care (n = 45)

Fig. 5
figure 5

Sensitivity and specificity of models predicting mortality (n = 33)

Plausible covariates to explain between-study heterogeneity

Overall, machine learning models for predicting ED disposition demonstrate a sensitivity of approximately 0.84 (95% CI 0.80–0.87) and a specificity of around 0.90 (95% CI 0.87–0.92) (see Table 5), indicating their higher ability to correctly identify negative cases. When differentiating ED disposition into the categories of admission, critical care, and mortality, using admission as the reference category for comparison (as depicted in Table 5), the sensitivity of admission prediction models (0.81, 95% CI 0.74–0.86) is slightly lower than that of critical care and mortality prediction models (0.86, 95% CI 0.79–0.91 and 0.85, 95% CI 0.80–0.89, respectively), although these differences are not statistically significant. Similarly, the specificity of admission prediction models (0.87, 95% CI 0.81–0.91) is also slightly lower than that of critical care and mortality prediction models (0.89, 95% CI 0.84–0.93 and 0.94, 95% CI 0.90–0.96, respectively), with a statistically significant difference observed between admission prediction models and mortality prediction models’ specificity (0.87 vs. 0.94, p = 0.027).

Table 5 Summary estimates for sensitivity and specificity

Plausible covariates for admission predictive models

This study compares the predictive abilities of different ED dispositions (admission, critical care, and mortality) to assess whether or not they are influenced by variables such as data characteristics, sample properties, or machine-learning methods.

Firstly, regarding the prediction of admission models, Table 6 shows that models using public datasets exhibit higher sensitivity (0.94) and specificity (0.90) when compared to those using private datasets (sensitivity = 0.80, specificity = 0.86), although the differences are not statistically significant. In terms of data structuring, both sensitivity (0.84) and specificity (0.88) of structured data models surpass those of unstructured data models (sensitivity = 0.72, specificity = 0.80) and those models combining unstructured and structured data (sensitivity = 0.74, specificity = 0.82). However, the differences among the three types of data are not statistically significant. Notably, among unstructured data, while some studies utilize image and free-text data, only eight models use free text as a feature for predicting admission, with no models using image data, thus this variable was not included for purposes of analysis. Among the eight models using free text data, the pooled sensitivity was 0.73 (95% CI 0.65–0.80), and the pooled specificity was 0.81 (95% CI 0.73–0.88), indicating moderate diagnostic accuracy.

Regarding sample properties, models using mixed samples (all age groups) demonstrate lower sensitivity compared to models using adult, youth, and elder samples, yet the specificity of models using mixed samples is higher than those using the other three types of samples. Sensitivity among the four different samples does not reach statistical significance, but the specificity of models using mixed samples (0.92 vs. 0.75, p = 0.027) is significantly higher than that of models using elderly samples.

In terms of machine-learning methods, models utilizing deep learning exhibit higher sensitivity (0.86) when compared to those generated using traditional machine-learning methods (sensitivity = 0.80), but the difference is not statistically significant (p = 0.541). The specificity of models built using traditional machine learning (0.87) is slightly higher than that of models generated using deep learning (0.86), also not statistically significant (p = 0.868). To further compare the performance of different machine-learning algorithms, models utilizing CNN show both higher sensitivity and specificity than those using RF, XGB, DNN, and RNN. Among these, the differences in sensitivity compared with XGB (p = 0.022) and DNN (p = 0.002) are statistically significant. Models built using ensemble learning demonstrate lower sensitivity (0.78) and specificity (0.86) when compared to models not using ensemble learning (sensitivity = 0.86, specificity = 0.88), but the differences are not statistically significant (p = 0.301 and 0.713). Additionally, models not using cross-validation exhibit a lower level of sensitivity (0.74) and a higher level of specificity (0.88) when compared to models using cross-validation (sensitivity = 0.84, specificity = 0.86), yet both sensitivity and specificity do not reach statistical significance (p = 0.170 and 0.541).

Table 6 Summary estimates for sensitivity and specificity of admission studies

Plausible covariates for critical care predictive models

In predicting critical care models, the sensitivity (0.87) and specificity (0.90) of models using private datasets are higher than those using public datasets (sensitivity = 0.76 and specificity = 0.73), but the differences do not reach statistical significance (p = 0.316 and 0.103). As there is only one model in the critical care prediction category that solely uses unstructured data, comparisons are made only between models using structured and combined data types. From Table 7, it is evident that the sensitivity (0.86) and specificity (0.90) of models using structured data are higher than or equal to those of models using combined data (sensitivity = 0.86, specificity = 0.87), but none of the differences are statistically significant (p = 0.865 and 0.626). In these prediction models, some models combine image and free-text data. This study further compares the impact of these two formats on prediction models, revealing that models combining image have higher sensitivity (0.87) compared to those combining free text (0.83), while models combining image have lower sensitivity (0.86) as compared to those combining free text (0.87), but neither difference reaches statistical significance (p = 0.530 and 0.861).

Regarding sample properties, since there is only one data point for unclear and the elderly in the models predicting critical care, they were not included in the analysis. Only adult, youth, and mixed samples were compared. From Table 7, it is observed that the sensitivity (0.90) of models using mixed samples is higher than those using adult (0.86) and youth (0.85), but the differences are not statistically significant (p = 0.664 and 0.762). Furthermore, the specificity (0.84) of models using mixed samples is lower than that of models using adult (0.90) but higher than that of models using youth (0.72), yet none of the differences are statistically significant (p = 0.616 and 0.204).

Regarding machine-learning methods, the sensitivity (0.88) and specificity (0.91) of models using traditional machine learning are higher than those using deep learning (sensitivity = 0.78 and specificity = 0.83), but the differences are not statistically significant (p = 0.205 and 0.171). When comparing specific machine-learning algorithms, CNN algorithm achieves both a sensitivity and specificity of 0.84. Compared to RF models, which have a higher sensitivity (0.91) and specificity (0.96), CNN models perform slightly lower on both metrics, though these differences are not statistically significant. Similarly, CNN models outperform LightGBM and LR in sensitivity (LightGBM: 0.69, LR: 0.78) but are on par with LR in specificity (0.84) and below LightGBM (0.98), without statistically significant differences. When compared to DNN, CNN models achieve higher sensitivity (DNN: 0.75) but perform similarly in specificity (DNN: 0.83), with no statistically significant differences. Overall, CNN models show a balanced performance in sensitivity and specificity compared to other algorithms. Concerning the use of ensemble learning, the sensitivity (0.91) and specificity (0.91) of models using ensemble learning are higher than those not using it (sensitivity = 0.69 and specificity = 0.81), but only sensitivity reaches statistical significance (p = 0.032). Lastly, the sensitivity (0.86) of models using cross-validation is slightly lower than those not using it (0.87), and the specificity (0.87) of models using cross-validation is also lower compared to models not using it (0.92). Neither difference reaches statistical significance (p = 0.952 and 0.288, respectively).

Table 7 Summary estimates for sensitivity and specificity of critical care studies

Plausible covariates for mortality predictive models

In models predicting mortality, both sensitivity (0.90) and specificity (0.95) are higher for models using public datasets than those using private datasets (sensitivity = 0.85, specificity = 0.94), but none of the differences are statistically significant (see Table 8). Regarding data structuring, since no models solely use unstructured data, comparisons were made only between models using structured and combined data. From Table 8, it is observed that both sensitivity (0.86) and specificity (0.95) of models using structured data are higher than those using combined data (sensitivity = 0.82, specificity = 0.85), but none of the differences are statistically significant (p = 0.591, 0.189). Although some models using combined data incorporate image, only one model utilizes free text, achieving a sensitivity of 0.92 (95% CI: 0.89–0.94) and a specificity of 0.81 (95% CI: 0.80–0.81). Therefore, a comparison between these two types of unstructured data was not conducted.

Regarding sample properties, as there are no models classified as ‘unclear,’ comparisons were made among models with samples classified as mixed, adult, youth, and elder. The results show that sensitivity (0.84) of models using mixed samples is lower than those using the other three types of samples (0.85 for adult, 0.99 for youth, and 0.88 for elder), but none of the differences are statistically significant. However, specificity (0.96) of models using mixed samples is higher than those using the other three types of samples (0.94 for adult, 0.89 for youth, and 0.90 for elder), yet none of the differences are statistically significant.

Regarding machine-learning methods, both sensitivity (0.86) and specificity (0.95) of models using machine learning are higher than those using deep learning (sensitivity = 0.83, specificity = 0.91), but neither difference reaches statistical significance (p = 0.709 and 0.442). To further compare the performance of different algorithms, models using the RF and LR algorithms have both higher sensitivity and specificity than those using CNN, though these differences are not statistically significant. Models employing LightGBM and DNN have higher sensitivity but lower specificity compared to CNN, with none of these differences reaching statistical significance. Additionally, models using XGB exhibit lower sensitivity than CNN, but their specificity is higher, also without statistical significance. Models using ensemble methods have higher sensitivity (0.88) and specificity (0.95) than those not using ensemble methods (sensitivity = 0.79 and specificity = 0.93), but none of the differences reach statistical significance (p = 0.095 and 0.620).

Regarding the use of cross-validation in prediction models, sensitivity is the same for models with and without cross-validation (0.85). However, models using cross-validation have significantly higher specificity compared to those not using it (0.96 vs. 0.87, p = 0.032). The summary sensitivity and specificity performance of mortality prediction models are presented in Table 8.

Table 8 Summary estimates for sensitivity and specificity of mortality studies

Summarization of plausible covariates for three predictive models

Summarizing the performance of the three disposition prediction models (see Table 9), First off, regarding data sources, models utilizing public data sources perform better in predicting admission and mortality when compared to those using private data sources, but the opposite trend is observed for predicting critical care. Secondly, concerning data structuring, models using structured data outperform those using both structured and unstructured data in predicting admission, critical care, and mortality. In admission-prediction models, those solely using unstructured data exhibit the poorest performance. As for sample properties, no distinct pattern emerges favoring any particular sample combination among the four different types.

In terms of machine learning methods, except for admission-prediction models where sensitivity is better with deep learning, both sensitivity and specificity in the other two prediction models favor machine learning. To further compare the performance of different machine-learning algorithms, CNN demonstrates superior sensitivity and specificity for admission prediction compared to other algorithms. For critical care prediction, XGB shows the highest sensitivity, while LightGBM excels in specificity. For mortality prediction, RF and LightGBM yield the best sensitivity, with RF also showing the highest specificity. However, when employing ensemble learning, the sensitivity and specificity of admission-prediction models are both lower as compared to models not using ensemble learning. For critical care and mortality-prediction models, the sensitivity and specificity of models without ensemble learning are higher than those with ensemble learning. Additionally, the use of cross-validation does not consistently guarantee better model performance; while admission-prediction models without cross-validation exhibit higher sensitivity and specificity, the sensitivity and specificity of critical care- or mortality-prediction models vary.

Table 9 Summarization of the performance of predictive models for three emergency department dispositions

Finally, this study employed HSROC to evaluate model performance, with the HSROC curves for admission, critical care, and mortality depicted in Figs. 6 and 7, and 8, respectively. It is evident from these figures that the mortality-prediction model exhibits higher precision compared to the admission and critical care prediction models. This is indicated by the smaller 95% prediction-interval region and 95% confidence region for the mortality-prediction model, as observed in Figs. 6, 7 and 8.

Fig. 6
figure 6

Summary receiver operating-characteristic curve for models predicting admission

Fig. 7
figure 7

Summary receiver operating-characteristic curve for models predicting critical care

Fig. 8
figure 8

Summary receiver operating-characteristic curve for models predicting mortality

Discussion

Based on the meta-analysis of 117 models extracted from 87 articles included in this study, the overall sensitivity and specificity for predicting ED disposition patterns were determined to be 0.84 and 0.90, respectively. These results indicate that the utilization of machine learning in predicting the discharge disposition of ED patients shows acceptable predictive capabilities. This capability allows for the early acquisition of patient disposition information, which can greatly aid in the effective allocation of medical personnel and resources within modern healthcare institutions.

Type of ED dispositions predicted

Upon further examination, among the 117 predictive models, 39 are focused on admission, 45 on critical care, and 33 on mortality. The meta-analysis reveals that mortality-prediction models exhibit the highest AUROC, followed by critical-care prediction models, with admission prediction models demonstrating the lowest performance. Sensitivity analysis indicates that critical care-prediction models have the highest sensitivity, followed by mortality-prediction models, while admission-prediction models have the lowest. Similarly, regarding specificity, mortality-prediction models show the highest, followed by critical care-prediction models, with admission-prediction models again displaying the lowest specificity.

Notably, when admission prediction serves as the reference category, sensitivity and specificity among these three types of prediction models generally do not exhibit statistically significant differences, except for the specificity of mortality-prediction models, which is significantly higher than that of admission-prediction models. Additionally, the specificity of these prediction models tends to outweigh sensitivity, suggesting a stronger ability to correctly identify true negatives but potentially missing some true positives. Future research may necessitate an iterative refinement process to enhance the sensitivity of ED disposition models.

Clarification of research purpose and scope

While this meta-analysis provides a quantitative overview of AI performance in predicting ED patient dispositions, it is important to recognize the heterogeneity among the predictive models included. These models vary in terms of the disposition types predicted, data used, machine learning methods, and patient conditions, which limits the generalizability of our meta-analysis results to specific clinical situations. The primary goal of this meta-analysis is to offer insights into general trends, strengths, and challenges in AI applications for ED disposition prediction, rather than to provide tailored recommendations for individual contexts. Additional research and development tailored to the unique demands of each clinical setting may therefore be necessary.

Public or private data source

In terms of the data utilized, the majority of the data are proprietary rather than publicly available. However, the analysis results show that when publicly available data are used for predicting admission and mortality, both sensitivity and specificity are higher compared to predictive models using private data. Conversely, when publicly available data are used for predicting critical care, sensitivity and specificity are lower than those of private datasets. The meta-analysis results of this study indicate an improvement in the predictive ability of publicly available data for predicting admission and mortality. In the realm of machine learning for skin image recognition, Tschandl et al. [116] argue for the importance of making skin image data publicly available, suggesting that by doing so would enhance skin image recognition performance. Since the number of evidences included in this study from public datasets is limited (n = 13), whether this argument applies to non-image-based data necessitates further research for validation.

Data structure of features

Based on the findings of this study, predictive models utilizing solely structured data consistently demonstrate higher sensitivity and specificity across all prediction categories: admission, critical care, and mortality. Notably, the lowest sensitivity and specificity are observed in models predicting admission solely based on structured features. This observation may stem from the necessity of feature extraction in handling unstructured data, wherein differences in extraction methods could impact prediction performance variability, ultimately leading to less effective performance than models using structured features alone.

Although the integration of both structured and unstructured features theoretically offers better informative data for purposes of model development, this assertion remains unconfirmed by the current meta-analysis. Specifically regarding unstructured data, while the use of image data for critical care prediction yields higher sensitivity compared to free text, the opposite is observed for specificity. However, neither type of unstructured data significantly influences the predictive outcomes of ED disposition.

Sample type

Regarding sample selection, models utilizing adult samples consistently demonstrate higher sensitivity and specificity across all three ED dispositions—admission, critical care, and mortality— when compared to models using other sample types. Models employing mixed samples exhibit superior sensitivity and specificity in predicting admission and critical-care dispositions compared to other sample types. Furthermore, models utilizing youth samples demonstrate higher specificity in predicting mortality compared to other samples. However, there is no discernible pattern in the performance of prediction models based on sample utilization, suggesting that the choice of sample may not substantially impact prediction model performance.

Machine learning vs. deep learning

Among the included models, machine learning remains predominant. Generally, in predicting the three ED dispositions, models built using machine-learning methods demonstrate higher sensitivity and specificity compared to those employing deep learning methods, except for admission-prediction models, where sensitivity is higher in deep learning-based models. This study infers that since the included research data primarily consist of structured data rather than images, the complexity may be lower, thus making machine-learning methods adequate for handling the task. Conversely, deep-learning methods may not effectively leverage their image processing capabilities in this context. Further analysis of different machine learning algorithms reveals that Convolutional neural networks perform best for predicting ED dispositions related to admission. For critical care disposition predictions, eXtreme Gradient Boosting and LightGBM models show superior performance, while for mortality predictions, Random forest and LightGBM models demonstrate the highest performance.

Ensemble-learning technique

Using ensemble learning is generally believed to improve the predictive capability of models [117]. However, according to the results of this study, the situation is not entirely straightforward. For the prediction of admission, models not utilizing ensemble learning performed better, while for predicting critical care and mortality, models employing ensemble learning outperformed in both sensitivity and specificity. The discrepancy in performance may be attributed to the fact that the ensemble learning techniques employed in the study were not identical. This suggests that the selection of appropriate methods and parameter configurations is crucial when utilizing ensemble learning.

Cross-validation technique

Regarding the use of cross-validation, the results of this study show a mixed picture. For predicting admission, models without cross-validation demonstrated superior sensitivity and specificity compared to those with cross-validation. However, for predicting critical care and mortality, models utilizing cross-validation exhibited higher sensitivity and specificity than those without. There was no clear pattern indicating that adopting cross-validation consistently enhanced model performance across all prediction categories. An inference drawn from this study is that among the 51 models employing cross-validation, 26 did not undergo hyper-parameter tuning to find the optimal settings, potentially leading to sub-optimal performance improvements. Finally, our review highlights that most models predominantly relied on internal validation rather than external validation, raising concerns about potential overfitting.

Risk of bias assessment

Our study used PROBAST to assess the risk of bias across four domains: participants, predictors, outcomes, and analysis. Overall, the results show that most studies have a low risk of bias, indicating that these studies are well-designed with adequate sample sizes and appropriate handling of missing data. This finding differs from previous PROBAST-based assessments [118, 119], which often identified a high risk of bias in most prediction models. This discrepancy may be due to the fact that PROBAST was not specifically designed for AI applications, and some signaling questions may not fully apply in this context. This limitation could be addressed once the PROBAST + AI tool is officially released.

Future research directions

The analysis results of this study suggest that the specificity of the included models for predicting admission, critical care, and mortality is higher than their sensitivity. This implies that the models excel in correctly identifying non-cases of ED disposition, but may struggle to identify all relevant cases. While the sensitivity of these prediction models exceeds 80%, there is room for improvement in their predictive capabilities. Recommendations for enhancement could be explored in the following areas.

Define standard features for predicting ED disposition at various stages

Throughout the ED visitation process, a wide array of data is generated, encompassing physiological signs, injury records, diagnoses, laboratory findings, radiographic images, and far more. In the models analyzed in this meta-analysis, the utilization of data varies considerably across stages, making it challenging to identify over-arching patterns for comparison. Future research could explore constructing predictive models in stages based on the data generated during patients’ ED visits and evaluate the performance of these models at each stage. Additionally, recommendations for features and timeframes applicable to each stage of ED visits could be proposed to facilitate further model development. Early prediction of patient disposition in the ED holds significant potential for optimizing emergency medical resource management, service capabilities, and overall allocations.

Build a public dataset for predicting ED disposition

Once suggested features are identified, relevant data can be collected based on these features to construct predictive models. Another logical step is to establish a public dataset by leveraging collaborative efforts from EDs worldwide. This shared dataset aims to support hospitals in developing their own models for predicting ED patient disposition. Furthermore, with access to these shared datasets, different models become comparative, potentially enhancing predictive performance. The results of this meta-analysis also suggest that models using public datasets outperform those using private datasets in predicting admission and mortality.

Structure the nature of features for predicting ED disposition

In theory, unstructured data may contain more crucial information, suggesting that utilizing unstructured data could lead to better predictive model performance. However, the analysis results of this study do not support this argument. Instead, models built using structured data outperformed those using both structured and unstructured data in predicting admission, critical care, and mortality. Models solely based on unstructured data performed the least satisfactorily in predicting admission. One possible explanation for this finding may be that the unstructured data were completed using templates, resulting in uniform content and reducing the significance of the information contained within. It is suggested that future research prioritize structured data as they contain primary features, with the simultaneous use of structured and unstructured data as additional features.

Sample dataset for predicting ED disposition

Due to the possibility of incomplete physiological maturity in younger emergency department patients, such as with infants, their response to illness may differ significantly from that of adults, particularly the elderly. Therefore, it is suggested that future studies consider distinguishing between age groups when constructing predictive models for emergency department patient disposition. This approach would better cater to the clinical needs of emergency departments. In the studies included in this meta-analysis, certain models were specifically tailored for the elderly [25, 33, 45, 47, 48, 65] or for adolescents/infants [44, 82].

Utilize tailored artificial intelligence techniques for predicting ED disposition

Based on the results of this meta-analysis, it appears that the predictive performance of deep- learning models is generally lower than that of machine-learning models, contrary to the common belief among the general public that deep learning outperforms. Subsequent research should further investigate possible reasons for this discrepancy and subsequently enhance the predictive capabilities of models built using deep-learning methods. Additionally, ensemble learning in this meta-analysis demonstrated superior performance in predicting critical care and mortality compared to models that did not utilize ensemble learning. Future research may consider employing different types of ensemble learning to identify more effective model architectures. Moreover, while cross-validation theoretically aids in improving model predictive ability, it is recommended that future studies utilize hyper-parameter tuning alongside cross-validation to enhance model performance. Lastly, future studies are strongly encouraged to adopt external validation to minimize the risk of overfitting.

Limitations

This review has several limitations that warrant acknowledgment. Firstly, caution is needed when interpreting the pooled sensitivity and specificity of this study due to the presence of between-studies heterogeneity. Secondly, 71 articles were excluded due to insufficient quantitative information. It is recommended that future research on ED disposition using machine learning provide sufficient metric information to enhance profile study characteristics.

Conclusions

The main aim of this study is to meta-analyze the performance of artificial intelligence techniques used in predicting ED dispositions. Due to the lack of objective assessments in existing review literature on this topic, a comprehensive understanding of how artificial intelligence performs in predicting ED disposition is limited. This limitation may hinder the effective utilization of this technology, which could be crucial for optimizing emergency medical resources and for addressing vital issues such as ED overcrowding.

The primary findings of this study indicate that machine-learning techniques applied to predict different ED dispositions, including admission, critical care, or mortality, achieve AUROC scores ranging from 0.87 to 0.93. Models predicting mortality perform the best, with sensitivity and specificity ranging from 0.81 to 0.94. However, the specificity for each of the three ED dispositions is higher than sensitivity, suggesting room for improvement in predicting positive cases of ED disposition. Feasible approaches to address this matter include:

  1. 1)

    To establish standardized feature sets for predicting ED dispositions;

  2. 2)

    To create shared datasets for training predictive models, accessible to both emergency medical practitioners and researchers;

  3. 3)

    To integrate structured and unstructured datasets; and,

  4. 4)

    To leverage machine-learning techniques such as cross-validation with hyper-parameter tuning and ensemble learning to enhance performance.