Introduction

Ischemic stroke is a leading global cause of morbidity and mortality, impacting one in four individuals over their lifetime [1]. Carotid atherosclerotic stenosis, accounting for approximately 20% of stroke cases, emerges as a critical risk factor [2]. As of 2020, the global prevalence of carotid plaque among individuals aged 30 to 79 was 20% [3]. The main mechanisms of stroke due to carotid atherosclerotic stenosis involve intracranial artery embolism from carotid unstable plaque rupture or erosion and reduced cerebral blood flow due to moderate and severe carotid stenosis [4].

Currently, surgical interventions such as carotid endarterectomy (CEA) or carotid artery stenting (CAS) are recommended for patients with severe stenosis or ischemic symptoms [5, 6]. However, recent studies have shown that plaque morphology also play a crucial role in the risk stratification of carotid plaques [7]. There is no consensus on the best methods to identify plaques that are at the highest risk of rupture, particularly in patients with moderate stenosis or those who are asymptomatic, especially given the improved prognosis with medical therapy alone [8]. The main challenge is to identify unstable plaques that are most likely to lead to a stroke, allowing for intervention only when the potential benefits outweigh the risks, while also avoiding unnecessary procedures [9].

Non-invasive carotid imaging plays a vital role in characterizing plaque features and predicting future events, which aids in better risk stratification and management [10]. Nevertheless, traditional manual analysis of imaging can be subjective, relying heavily on the operator’s skills and often failing to extract all pertinent information.

Artificial intelligence (AI) offers a promising approach to improve plaque detection and risk stratification by assisting radiologists in more accurately identifying unstable carotid plaques on imaging. By analyzing imaging data such as ultrasound (US), AI algorithms can identify subtle plaque features and provide consistent, objective evaluations that help distinguish stable from unstable plaques [11], while also reducing workload and enhancing reproducibility in clinical practice [12]. AI-based approaches applied to US, CT, and MRI can support robust plaque tissue characterization and classification, thereby offering potential clinical value in unstable plaque identification and stroke prevention [13]. A recent meta-analysis by Saba et al. evaluated radiomics-based machine learning on CT and MRI and reported satisfactory performance in predicting culprit plaques (AUC 0.85), which improved further when combined with clinical features [14]. However, this review was limited to radiomics approaches on CT and MRI, without considering US or other AI-assisted methods.

To provide a more comprehensive overview, we conducted a systematic review and meta-analysis of all AI-assisted approaches—including radiomics and deep learning—across multiple imaging modalities (US, CT, and MRI) to assess their diagnostic performance in distinguishing unstable from stable carotid plaques (Fig. 1).

Fig. 1
figure 1

Schematic overview of the study design and main results

Methods

Protocol registration and study design

The study protocol is registered with PROSPERO International register of systematic reviews, number CRD42023414260, and we followed PRISMA guidelines for the systematic review and meta-analysis [15]. No ethical approval or informed consent was required for this review.

Search strategy and eligibility criteria

We systematically searched Medline, Embase, IEEE, PubMed, Web of Science, and the Cochrane Library for AI-based carotid plaque risk stratification studies in medical imaging up to June 6, 2023. We included only English-language articles. Supplementary Note 1 provides details on our database-specific search strategies.

Eligible studies include those that reported AI technologies for carotid plaque risk stratification using medical radiology images, with diagnostic outcomes such as sensitivity, specificity, accuracy, AUC, or detailed contingency table information. Excluded were duplicate publications, reviews, editorials, studies unrelated to the target task, those with unreliable reference standards, lacking full-text, written in languages other than English, or not based on image classification. Two reviewers (Y.F and L.X) independently screened titles and abstracts according to these criteria, with relevant articles downloaded and reviewed in full text. Any disagreements were resolved through discussion with a third author (Z.L) to reach a consensus.

Data extraction

Two reviewers (Y.F and L.X) independently extracted study characteristics and diagnostic performance using a standardized data extraction sheet. Disagreements were resolved through discussion, or a third investigator (Z.L) was consulted if needed.

Diagnostic accuracy data, including true-positive (TP), false-positive (FP), true-negative (TN), and false-negative (FN), were directly extracted into contingency tables to calculate sensitivity and specificity. Studies without sufficient data to construct a contingency table were excluded. In cases where a study provided multiple contingency tables for the same or different AI algorithms, we considered them to be independent of each other. Supplementary Table 1 provides a summary of the contingency tables extracted from the included studies.

Study quality assessment

Two independent reviewers (Y.F and L.X) assessed the quality of all selected studies using the QUADAS-AI criteria [47]. The details can be found in Supplementary Table 2. QUADAS-AI includes four domains (patient selection, index test, reference standard, flow and timing) for assessing the risk of bias and three domains (patient selection, index test, reference standard) for applicability concerns. This tool serves as an AI-specific extension to QUADAS-2 [48] and QUADAS-C [49], offering researchers a specific framework for evaluating bias risk and applicability in reviews focusing on AI-based diagnostic test accuracy. Any conflicts were resolved through discussion with a third collaborator (Z.L).

Meta-analysis

We performed the diagnostic test accuracy meta-analysis using the midas module in Stata (Version 15.1). This package implements the bivariate random-effects model [50] to jointly pool sensitivity and specificity and generate summary receiver operating characteristic (SROC) curves. Forest plots were generated using the same package to summarize sensitivity and specificity. Each study contributed its original 2 × 2 contingency table (true positives, false positives, false negatives, true negatives) directly to the model; no single summary effect size (e.g., diagnostic odds ratio) was pre-computed. The AUC and its 95% confidence interval were derived from model parameters using the delta method as implemented in midas. The SROC figures display the pooled curve together with 95% confidence and prediction regions.

Heterogeneity was quantified using I2 statistics for sensitivity and specificity separately [51]. Approximate thresholds of 25, 50, and 75% were interpreted as low, moderate, and high heterogeneity, respectively. Subgroup and meta-regression analyses were conducted to explore potential sources of heterogeneity.

We conducted subgroup analysis as follows: 1) based on sample size (≤300 or > 300), 2) according to AI algorithms (Machine Learning (ML) or Deep Learning (DL)), 3) considering imaging modalities (Ultrasound (US), Computed Tomography (CT), or Magnetic Resonance Imaging (MRI)), 4) categorized by the year of publication (before or after 2017), 5) dependent on segmentation method (automatic or manual), and 6) considering the type of image features used (traditional plaque evaluation features or radiomics features). The thresholds of 300 for sample size and 2017 for the publication year were selected to create balanced subgroups for comparisons. Meta-regression was conducted using midas with these covariates (algorithm, sample size, publication year, segmentation method, and feature type).

We evaluated the methodological quality of included studies using QUADAS-AI in RevMan (Version 5.4). Funnel plots were generated using Midas to evaluate publication bias. Meta-analysis was only performed where there were more than or equal to three original studies. Statistical significance was defined as a two-tailed p < 0.05.

Results

Study selection and characteristics of eligible studies

From an initial search yielding 1573 records, 757 duplicates were removed, and 749 studies were excluded based on title and abstract screening. This process resulted in 67 studies for full-text review. Ultimately, 31 articles were included in the systematic review, with 14 providing sufficient data for meta-analysis (Fig. 2).

Fig. 2
figure 2

PRISMA flowchart of study selection

Twenty studies utilized retrospective patient data, ten studies employed prospective data, and one study did not specify data origin. None of the studies detailed the process of excluding low-quality images. Only one study employed an out-of-sample dataset for external validation. Imaging modalities were categorized as US (n = 23), MRI (n = 5), and CT (n = 3). The distribution of studies on AI algorithms is as follows: DL (8 studies), ML (21 studies), and both (2 studies). Table 1 provides detailed characteristics of these included studies.

Table 1 Characteristics of studies included in this review

Pooled performance of AI algorithms

In the primary analysis (one contingency table per study—the highest-accuracy 1), the pooled sensitivity was 91% (95% CI 86–95%) and specificity was 84% (95% CI 78–89%), with an AUC of 0.94 (95% CI 0.91–0.95) (Fig. 3a). An auxiliary analysis including all reported contingency tables (67 contingency tables from the same 14 studies) yielded pooled sensitivity 89% (95% CI 86–90%) and specificity 84% (95% CI 82–86%), with an AUC of 0.93 (95% CI 0.90–0.95) (Figs. 3b, 4). However, of the 14 studies included, only one performed external validation. This raises concerns regarding generalizability of the pooled estimates.

Fig. 3
figure 3

ROC curves of all studies included in the meta-analysis (14 studies). a: SROC curves of studies when selecting contingency tables reporting the highest accuracy (14 studies with 14 tables). b: SROC curves of 14 studies included in the meta-analysis (14 studies with 67 tables). Abbreviations: SROC = summary receiver operating characteristic; SENS = summary sensitivity; SPEC = summary specificity

Fig. 4
figure 4

Forest plot of studies included in the meta-analysis (14 studies)

Quality assessment

The included studies were evaluated for quality using QUADAS-2 (Fig. 5). Detailed assessments for each item, categorized by the domain of risk of bias and concern of applicability, are provided in Supplementary Fig. 1.

Fig. 5
figure 5

QUADAS-AI summary plot

In the patient selection domain of risk of bias, 8 studies were deemed high or unclear risk due to unreported or unclear patient eligibility criteria, improper exclusion, or an unclear description of the data source. For the index test domain, one study was considered high risk because it did not predefine the specific type of index test until after seeing the results. Additionally, 8 studies exhibited an unclear risk of bias for the reference standard, as they did not provide a clear description, hindering our judgment. A subset of studies (n = 6) demonstrated either a high risk or an unclear risk of bias for flow and timing, as they did not disclose the interval between the reference standard and index test, or the interval was deemed unreasonable.

In the applicability concern domain, 3 studies had unclear applicability in patient selection, and 4 studies in the reference standard due to insufficient description of patient eligibility criteria or reference standard. One study raised high concerns about applicability in the index test domain, as the imaging detection method is not commonly used in clinical practice. Additionally, another study had high concerns about applicability in the reference standard domain because its reference standard (use B-mode US only to define unstable carotid plaque) did not conform with the current criteria for detect unstable carotid plaque [6, 10, 52, 53].

Subgroup meta-analyses

Considering the algorithm’s developmental stage and inherent differences, we categorized them into ML and DL algorithms for a sub-analysis. Results showed a pooled sensitivity of 89% (95% CI: 86 − 91%) for ML and 88% (95% CI: 79 − 94%) for DL, along with a pooled specificity of 84% (95% CI: 81 − 86%) for ML and 85% (95% CI: 75 − 92%) for DL (Supplementary Fig. 24).

11 US studies had a pooled sensitivity of 89% (86 − 90%), a pooled specificity of 84% (82 − 86%), and with an AUC of 0.93 (0.90 − 0.95). One CTA studies that had a pooled sensitivity of 82% (73 − 88%), pooled specificity of 85% (76 − 90%), and an AUC of 0.90 (0.87 − 0.92). Two MRI studies with a pooled sensitivity of 88% (82 − 92%), pooled specificity of 88% (65 − 97%), and an AUC of 0.91 (0.88 − 0.93) (Supplementary Fig. 58).

Other characteristics from included studies are summarized in Supplementary Tables 3 and 4. Among them, 8 studies had sample sizes ≤300 and 6 studies had sample sizes > 300. The pooled sensitivity was 87% (83 − 91%) for sample size ≤300, and 89% (86 − 91%) for sample size > 300. The specificity for ≤300 was 84% (78 − 89%) and 84% (81 − 86%) for > 300. The AUC was 0.92 (0.90 − 0.94) for ≤300 and 0.93 (0.90 − 0.95) for > 300 (Supplementary Fig. 911).

Eight studies were published before 2017 and six studies were published after 2017. The pooled sensitivity was 83% (80 − 85%) for published before 2017, and 92 (89 − 94%) for published after 2017. The specificity was 81% (78 − 83%) and 86% (82 − 89%), respectively. The AUC was 0.89 (0.86 − 0.91) and 0.95 (0.93 − 0.97), respectively (Supplementary Fig. 1214).

Two studies used automatic segmentation and 8 studies used manual segmentation. The pooled sensitivity was 82% (78 − 86%) for automatic segmentation, and 89% (86 − 92%) for manual segmentation. The specificity for automatic segmentation was 77% (73 − 81%) and 83% (78 − 87%) for manual segmentation. The AUC was 0.87 (0.83 − 0.89) for automatic segmentation and 0.93 (0.90 − 0.95) for manual segmentation (Supplementary Fig. 1517).

11 studies used radiomics features and 3 studies used traditional features. The pooled sensitivity was 89% (86 − 91%) for radiomics features, and 89% (79 − 94%) for traditional features. The specificity for radiomics features was 85% (83 − 87%) and 74% (58 − 85%) for traditional features. The AUC was 0.93 (0.90 − 0.95) for radiomics features and 0.90 (0.87 − 0.92) for traditional features (Supplementary Fig. 1820).

Heterogeneity analysis

The meta-analysis of 14 studies indicates that AI algorithms contribute significantly to the risk stratification of carotid plaques in medical imaging, based on a random-effects model. However, substantial heterogeneity exists among the included studies, with sensitivity having an I2 of 92.42% and specificity having an I2 of 94.06% (p < 0.01) (Fig. 3). Detailed results from subgroup and meta-regression analyses, exploring potential sources of between-study heterogeneity, are presented in Table S5 and Supplementary Fig. 2–20, revealing statistically significant differences. Visual inspection of funnel plots suggests publication bias among the studies (p < 0.01) (Supplementary Fig. 21).

Discussion

Accurate and timely recognition of unstable carotid plaques using imaging is crucial for guiding appropriate medical management and improving patient prognosis. With the rapid expansion of artificial intelligence (AI) in medical imaging, radiomics and other AI models have been increasingly investigated for the identification of unstable plaques [13, 14]. In this systematic review and meta-analysis, we evaluated the performance of AI systems in differentiating stable from unstable carotid plaques across three imaging modalities: US, CT, and MRI (Fig. 1). Our study adhered strictly to diagnostic review guidelines and involved a comprehensive literature search in both medical and engineering/technology databases to ensure the study’s rigor. This study provided details on the performance of different algorithms, imaging modalities, sample sizes, publication years, segmentation methods and imaging features. We identified potential sources of heterogeneity among studies through subgroup and meta-regression analyses. Importantly, we rigorously assessed the quality and risk of bias in the studies using an adapted QUADAS-AI assessment tool, which is a strength of this systematic review and will guide future related studies more effectively.

Our meta-analysis demonstrated that AI algorithms achieve excellent diagnostic performance in distinguishing unstable from stable carotid plaques, with a pooled AUC of 0.94 (Fig. 2). This finding is consistent with previous evidence. For example, Saba et al. reported pooled AUCs of 0.85 for radiomics models and 0.89 when radiomics features were combined with clinical data in a radiomics-focused meta-analysis of CT and MRI [14]. Similarly, a recent meta-analysis of AI-assisted CTA also found high performance (AUC = 0.96) for stenosis and calcium detection [54]. Taken together, these results highlight the potential clinical value of AI in stroke prevention, particularly for the earlier detection of unstable plaques in patients with moderate stenosis or asymptomatic disease, where management decisions remain uncertain [13].

Compared with radiologist interpretation, AI shows potential advantages. Pakizer et al. reported high accuracy for MRI (90%) and CT (86%) but lower accuracy for US (80%) when using conventional imaging features [55]. Although direct comparison is limited by different metrics, our results suggest AI improves performance, particularly in US, where AI-assisted models achieved sensitivity of 0.89 and specificity of 0.84 (Supplementary Fig. 8). Evidence also indicates that AI may outperform radiologists: one study showed deep learning models with higher AUCs than experienced readers [34], and another in intracranial atherosclerosis found radiomics superior to radiologist evaluation [56]. These findings support AI as a complement to conventional assessment, though more head-to-head studies are needed.

Among AI-assisted diagnostic models, most are built on either radiomics features or conventional imaging features. Traditional features assessed by radiologists include intraplaque hemorrhage, lipid-rich necrotic core, disrupted surface, minimum luminal area, degree of stenosis, and enhancement ratio [37, 40]. However, their diagnostic value is limited by the resolving capacity of the human eye and the subjective nature of image interpretation. Radiomics, by contrast, extracts and quantifies a large number of high-dimensional features that are imperceptible to human observers [11]. Several studies compared the performance when using different imaging features (traditional or radiomics) [27, 37, 40]. These studies showed the radiomics model outperformed the traditional model. Our subgroup analysis of this meta-analysis also shows the advantages of radiomics features-based models (Supplementary Fig. 18).

In studies published after 2017, the performance of AI models surpassed that of studies before 2017 (Supplementary Fig. 12). The possible reasons are as follows: improvement of data quality and availability, continuous advancements and optimizations in AI algorithms and architectures, increased computational power and specialized hardware, and publication bias. Yet, we must admit the fact that AI models with good performance are more easily to be published.

Advancements in ML have enabled the analysis of large medical image data, but face limitations with manual feature extraction, hindering full automation. DL, a newer ML class, overcomes some limitations by employing multiple neural network layers, increasing computational power, though prone to overfitting and requiring larger datasets [57]. In our sub-analysis of different algorithms, no significant difference was observed, likely due to small dataset sizes, limiting DL advantages. However, as only three DL studies were included in our analysis, the diagnostic performance of DL models remains to be further validated. For instance, a recent sudy proposed a DL-based framework for carotid plaque detection and demonstrated that its diagnostic accuracy exceeded that of 4 out of 6 experienced radiologists [58].

In this meta-analysis, imaging modalities include US (11 studies), CTA (1 studies) and MRI (2 studies). The diagnostic performance of AI was similar across different imaging modalities, suggesting that it can effectively assist in clinical decision-making, whether based on US, CTA, or MRI. Some studies have demonstrated highly accurate risk stratification of carotid plaques using US alone, highlighting its potential for broad clinical application. However, US is a 2-dimensional imaging modality, while CTA and MRI provide three-dimensional imaging. Given that atherosclerotic plaques are three-dimensional structures, future development of 3D US combined with AI could help overcome this limitation and enhance the detection of unstable plaques. Additionally, US has other limitations, such as lower resolution and less detailed tissue characterization compared to MRI and CT, which may affect its diagnostic accuracy [34]. In summary, enhancing imaging modalities by utilizing 3D information from CTA and MRI, or combining multiple imaging techniques, may strengthen the capabilities of AI.

As for imaging segmentation methods, only two studies used automatic segmentation [21, 22]. Models based on automatic segmentation did not perform as well as models based on manual segmentation (Supplementary Fig. 15). However, automatic segmentation has potential advantages of consistency and efficiency. Automatic segmentation can provide a consistent approach to ROI delineation across large datasets, while manual segmentation might suffer from inter-observer and intra-observer variability. A recent study applied AI for automated segmentation of carotid plaques on US images and demonstrated that the model could differentiate unstable from stable plaques with good diagnostic performance in the validation cohort, achieving an AUC of 82.7% (95% CI: 71.6–93.8%) [59]. Manual segmentation is labor-intensive and time-consuming, especially in high-resolution images or when precise delineation is required. The performance of automatic segmentation models can be heavily dependent on the quality and representativeness of the training data. Thus, the gap between their performance might narrow as automatic segmentation algorithms improve.

While AI algorithms offer significant potential in medical imaging, their adoption in clinical practice requires careful consideration of methodological limitations. One major challenge is explainability—the ability to communicate AI decision-making in terms understandable to humans [60, 61]. Explanations supporting a model’s output are crucial in medicine, as clinicians need more than a binary prediction to support diagnostic decisions [62]. However, none of the included studies provided explicit explanations of their AI models. Current research on explainable AI (XAI) in medical imaging has proposed techniques such as saliency and heat maps, Grad-CAM, attention mechanisms, and feature importance analysis, which can help visualize the regions or features driving AI predictions [63]. These methods can enhance transparency and user trust, but have not yet been widely applied in carotid plaque imaging. Future research should therefore prioritize integrating explainability into AI models for plaque assessment. An important direction is to connect imaging features identified by algorithms to specific pathological characteristics (e.g., inflammatory infiltration, lipid-rich necrotic core, angiogenesis), which would increase interpretability, provide mechanistic insight, and offer more actionable information for guiding treatment decisions. Such efforts are essential for improving clinical acceptance of AI-based tools.

Furthermore, most studies were conducted in single centers with limited data availability. Only three of the included studies had external validation, which involves testing the model’s performance with out-of-sample datasets from other institutions. However, only one of these studies was included in the meta-analysis, preventing a subgroup analysis. This highlights the need for rigorous and reliable evaluation of AI performance using multi-center datasets. The included studies mostly divided an institution’s dataset into training, test, or internal validation sets to judge performance. To assess how the model performs on patients from different populations, it is preferable to obtain a new dataset from a different organization for external validation. The lack of external validation sets may lead to overestimation of results and compromise the generalizability of the model.

Finally, there is a scarcity of prospective studies conducted in real clinical environments. Most of the included studies relied on retrospective data from hospital medical records. Prospective studies provide more robust evidence, and we expect more prospective AI research to emerge in the future. Future research should also prioritize validation on larger multi-center datasets, as well as real-time AI applications integrated into clinical workflows. These steps are essential for ensuring that AI tools are robust, generalizable, and clinically useful for guiding patient management.

This study possesses several limitations. Beyond the previously mentioned publication bias, extracting multiple configurations from a single study can introduce bias; therefore, our subgroup displays are exploratory and hypothesis-generating, and additional studies are needed to provide definitive comparative conclusions. Also, our omission of studies lacking contingency tables may introduce bias. These studies offer valuable data, yet they could not be incorporated into the meta-analysis. Language bias is present as we exclusively considered English-language studies, omitting results from other languages. Furthermore, a small proportion of the included studies were judged to be at high risk of bias according to QUADAS-AI. This limitation likely contributes to the observed high heterogeneity and lowers the certainty of our pooled estimates. Therefore, our findings should be interpreted with caution.

In conclusion, our meta-analysis demonstrated that AI algorithms achieved high diagnostic performance in distinguishing unstable from stable carotid plaques, with pooled sensitivity of 91%, specificity of 84%, and an AUC of 0.94. These results highlight the promise of AI-assisted imaging for risk stratification and for identifying patients who may benefit from intervention beyond medical therapy. However, the evidence is limited by the relatively low methodological quality of included studies, the lack of external validation in most datasets, and the very high heterogeneity (I2 > 90%). Taken together, while AI shows substantial potential in carotid plaque imaging, these limitations temper the certainty and generalizability of our findings and underscore the need for more prospective, externally validated studies.