Background

Non-alcoholic fatty liver disease (NAFLD) has emerged as a global health challenge with escalating prevalence rates. According to World Health Organization estimates, 20–30% of adults worldwide are affected by NAFLD [1]. Regional epidemiological studies reveal significant variations, with prevalence reaching 30% in the United States [2] and 15.3% in China, demonstrating distinct gender disparities [3]. The disease is particularly prevalent in populations with metabolic comorbidities such as obesity and diabetes, positioning NAFLD as a critical chronic metabolic disorder requiring urgent public health attention.

NAFLD is characterized by pathological fat accumulation in hepatocytes unrelated to alcohol consumption, often accompanied by hepatic inflammation and fibrosis. As an independent risk factor, it exhibits strong associations with metabolic syndrome components including diabetes, hypertension, and dyslipidemia [4]. Although frequently asymptomatic in early stages, NAFLD carries significant progression risks to advanced histological forms such as cirrhosis and hepatocellular carcinoma, underscoring the imperative for early detection and intervention.

Traditional diagnostic paradigms have relied on clinical evaluation combined with radiographic imaging and histopathological confirmation. While liver biopsy remains the diagnostic gold standard, its invasive nature and associated complications limit routine application [5]. Conventional imaging modalities including ultrasound, computed tomography (CT), and magnetic resonance imaging (MRI) provide non-invasive alternatives but face limitations in differentiating simple steatosis from progressive forms of NAFLD [6].

The diagnostic landscape has evolved significantly with technological advancements. Initial reliance on liver enzyme levels and basic ultrasound features has given way to sophisticated imaging techniques enabling quantitative assessment of hepatic fat content and fibrosis staging. Modern CT and MRI protocols now permit non-invasive evaluation of liver fat percentage and spatial distribution, addressing previous limitations in diagnostic specificity.

The integration of artificial intelligence (AI) with medical imaging has revolutionized NAFLD diagnostics through three primary methodological approaches [7]:

  • Computer Vision Systems: Analyze textural patterns and contrast variations in liver imaging for pathological feature extraction.

  • Machine Learning Models: Employ algorithms like support vector machines and random forests to classify clinical-imaging data composites.

  • Deep Learning Architectures: Utilize convolutional neural networks (CNNs) for automated analysis of medical imaging datasets.

Notable advancements include Perveen et al.’s decision tree-based risk prediction system [8], Ji et al.’s XGBoost framework emphasizing metabolic parameters [9], and Che et al.’s multi-scale CNN achieving \( > \)90% classification accuracy [10]. CT-based innovations are particularly promising, with Huo et al.’s ALARM system automating liver attenuation measurements [11] and Graffy et al.’s deep learning tool enabling large-scale steatosis quantification [12]. In clinical practice, CT-based assessment of hepatic steatosis severity typically involves segmentation of the liver and spleen. The detection rate of hepatic steatosis by CT surpasses ultrasound in both accuracy and specificity, making it the preferred modality for definitive diagnosis. Common quantitative approaches include fat density measurement, liver volumetry, liver fat scoring systems, and the liver-spleen attenuation ratio method. The liver-spleen ratio method has gained particular traction due to its operational simplicity - requiring only density measurements from CT images to calculate the ratio - making it a rapid and widely adopted clinical tool for severity assessment.

Despite these advancements, current AI implementations face challenges in clinical translation, including requirements for large annotated datasets, computational resource demands, and limited generalizability across imaging platforms. Recent systematic reviews highlight ultrasound’s continued dominance in AI-based NAFLD research [13], while CT-based studies remain comparatively underexplored despite their advantages in quantitative assessment and operator independence.

The clinical imperative for improved NAFLD management stems from two critical gaps: 1) the invasiveness and sampling variability of liver biopsy, and 2) the subjectivity and labor-intensity of radiologist-dependent CT analysis. Current clinical protocols require manual region-of-interest (ROI) selection for liver-spleen attenuation ratio calculation - a process prone to inter-observer variability and diagnostic latency [14].

This context creates strong motivation for developing automated solutions that combine CTs quantitative capabilities with deep learning’s pattern recognition strengths. Such systems must address key requirements: 1) accurate 3D segmentation of liver-spleen anatomy, 2) robust fat quantification metrics, and 3) clinical integration through user-friendly interfaces. The proposed work aims to bridge these gaps through three innovations: a semi-automated data curation framework, an enhanced segmentation architecture, and an automated severity scoring system - collectively designed to enhance diagnostic objectivity and streamline clinical workflows. The following subsections provide detailed technical descriptions of these three modules, with their structural relationships visually summarized in Fig. 1(a).

  1. 1.

    Semi-automatization nnU-Net Module (SNM): Enables standardized CT dataset creation with dual-organ labeling through intelligent human-AI collaboration, addressing annotation bottlenecks while preserving clinical validity.

  2. 2.

    Focal Feature Fusion Swin-Unet Module (FSM): Enhances 3D segmentation performance through transformer-based feature fusion, achieving Dice scores of 0.962 \(\pm\) 0.021 for liver and 0.942 \(\pm\) 0.034 for spleen in preliminary trials.

  3. 3.

    Automated Severity Scoring Module (ASSM): Implements CT attenuation ratio analysis with spatial distribution weighting, demonstrating 92.7% diagnostic concordance with expert radiologists in validation studies.

Fig. 1
figure 1

Flow chart of the proposed method. (a) Flow chart of the experimental process. (b) Schematic diagram of the raw data and the data after processing. (c) Automated method for severity scoring of NAFLD

The remaining sections are organized as follows: Section 2 introduces the materials and methods. Section 3 presents the results. Section 4 introduces the discussion. Section 5 consists of the conclusions and prospects.

Methods

All data were processed in compliance with the 1964 Helsinki Declaration and its subsequent amendments, ensuring the removal of personally identifiable information. Consent for usage was obtained, and public dataset guidelines were strictly followed. Fig. 1 presents the overall framework of the proposed method in this study. We describe the three crucial modules of this study in the following subsections: (1) Semi-automatization nnU-Net module (SNM): This module is used for semi-automated dataset construction. (2) Focal Feature Fusion Swin-Unet module (FSM): This module serves as the NAFLD segmentation detection model with the aim of improving the segmentation of the liver and spleen. (3) Automated severity scoring module (ASSM): This method is designed to combine the segmentation model to achieve NAFLD severity scoring.

Semi-automatization nnU-Net module (SNM)

In order to verify the performance of our proposed method, an improved version of The Medical Segmentation Decathlon(MSD) dataset is used, and the dataset is divided using the cross-validation method.

The MSD competition aims to promote research progress in the field of medical image segmentation  [15]. The MSD competition provides a series of medical image segmentation datasets, including different types of image data from multiple medical fields (such as neurology, cardiology, mammography, etc.). In this study, we use liver CT images and pancreatic CT images, and the original data is described in the Table  1.

Table 1 Description of liver and spleen raw data in MSD

Liver: This dataset consists of contrast-enhanced CT images from 201 patients with primary cancer and metastatic liver disease, with these diseases resulting from colon cancer, breast cancer, and lung cancer. The corresponding target ROI is the segmentation of the liver and intrahepatic tumors. The data was obtained at IRCAD Hôpitaux University in Strasbourg, France, and includes a subset of patients from the Liver Tumor Segmentation (LiTS) challenge in 2017. Spleen: This dataset includes portal phase CT scans of 61 patients with liver metastasis undergoing chemotherapy. The corresponding target ROI is the spleen. The data was obtained at the Sloan Kettering Cancer Center in New York, USA.

Given the substantial heterogeneity between liver and spleen CT datasets in acquisition protocols and anatomical characteristics, the integration of multi-institutional data necessitated rigorous imaging compatibility protocols to address cross-scanner variability and pathological confounders. All CT scans underwent comprehensive preprocessing beginning with spatial standardization using third-order B-spline interpolation to achieve isotropic \(1.5\times1.5\times3~\mathrm{mm}^3\) voxel resolution, ensuring anatomical consistency across heterogeneous scanner platforms. Intensity normalization was implemented through a two-stage process: initial phased windowing (\(WL=40/WW=400\)) optimized soft-tissue contrast, followed by z-score standardization (\(\mu=60 HU, \sigma=25 HU\)) referenced to splenic parenchyma regions-of-interest. Through expert consensus, non-conforming cases were excluded, including those with non-portal venous phase scans, liver-spleen pathological interactions, and technically deficient scans.

To address the spleen dataset’s critical limitation of limited samples (n = 61), we developed a semi-automated data augmentation framework comprising three synergistic phases. First, dedicated nnUNet models were independently trained - a tumor-excluded liver model using 201 hepatic CT scans and a spleen model utilizing all 61 splenic studies  [16]. Cross-domain prediction was subsequently implemented where each model processed the other dataset’s scans (liver model predicting spleen cases and vice versa) to generate preliminary segmentation masks. A rigorous quality control protocol was then applied: board-certified radiologists refined these predictions across 262 candidate scans using 3D Slicer software, with final selection of 98 optimal studies based on quantitative metrics including slice thickness consistency, contrast-to-noise ratio, and inter-rater annotation agreement.

The final curated dataset comprised 98 diagnostically coherent cases, achieving 4.3\(\times\) effective expansion of spleen training data while reducing manual segmentation workload by 80% compared to full manual annotation. This tripartite approach preserved anatomical fidelity through controlled heterogeneity, maintaining balanced representation of NAFLD progression stages and splenic morphological variants, crucially ensuring diagnostic validity for small-organ segmentation tasks.

Focal feature fusion Swin-Unet module (FSM)

The primary task of medical image segmentation is to separate regions of interest from the background of the images. Regions of interest typically include organ structures, tumors, and more. In this paper, precise segmentation of the liver and spleen is required, which falls under the category of simultaneous multi-organ segmentation in abdominal CT images. However, challenges in achieving accurate segmentation arise due to factors such as the presence of surrounding soft tissues, organ deformations and variations, low-intensity contrast between adjacent organs, and high annotation costs.

In this paper, we primarily utilize the Swin-Unet as the fundamental model for our automatic segmentation strategy, and we have made enhancements to the baseline structure. Swin-Unet is a medical image segmentation model that combines the Swin Transformer with the U-net structure  [17]. In traditional U-net models, both the encoder and decoder employ convolutional neural network structures, which perform well in medical image segmentation tasks. However, this structure has its limitations: due to the local attention mechanism of the Swin Transformer, Swin-Unet may suffer from information flow issues when dealing with long-distance pixel connections in images, leading to the network’s inability to capture effective fine-grained information, resulting in local segmentation errors or inaccurate edges. To address these issues, this paper introduces the FF Swin-Unet, which improves the segmentation of the liver and spleen by changing the backbone module to the Focal Transformer and adding the downsampling FFM module.

The Swin-Unet model segments the input medical images into non-overlapping image blocks of the same size and inputs them as tokens into the encoder. To acquire contextual information, the encoder employs hierarchical Swin Transformer Blocks as its basic components and applies the shifted window technique to each Swin Transformer block. The encoder’s output is fused with the upsampling operation of the decoder and combined through skip connections to capture multi-scale features. The patch expanding layer in the decoder uses Swin Transformer operations for upsampling to maximize the spatial resolution of image segmentation. The architecture of Swin-Unet is depicted in Fig. 2(a) below:

Fig. 2
figure 2

Structure of the proposed focal feature fusion Swin-Unet module in detail. (a) Structure of Swin-Unet. (b) Structure of the focal feature fusion Swin-Unet, where the orange part is the main improved structure

The successive swin transformer blocks can be represented as:

$$\hat{z}^{l}=W\mathrm{-}MSA(LN(z^{l-1}))+z^{l-1},$$
(1)
$$z^{l}=MLP(LN(\hat{z}^{l}))+\hat{z}^{l}, $$
(2)
$$\hat{z}^{l+1}=SW\mathrm{-}MSA(LN(z^{l}))+z^{l}, $$
(3)
$$z^{l+1} =MLP(LN(\hat{z}^{l+1}))+\hat{z}^{l+1}, $$
(4)

where W-MSA and SW-MSA respectively denote window-based multi-head self-attention and its shifted window variant; MLP refers to a multilayer perceptron with GELU activation function; LN represents layer normalization operations. The variables \(z^{l}\) and \(\hat{z}^{l}\) correspond to the output tensors from the MLP module and the (S)W-MSA module within the \(l^{th}\) transformer block, respectively. The self-attention is calculated as follows:

$$Attention(Q,K,V)=SoftMax(\frac{QK^T}{\sqrt{d}}+B)V$$
(5)

where \(Q,K,V \in \mathbb{R}^{M^2 \times d}\) denote query/key/value matrices derived from linearly projected input features. Here, \(M^2\) indicates the number of patches per window, \(d\) the feature dimension, and \(B \in \mathbb{R}^{M^2 \times M^2}\) represents the learnable relative position bias initialized from \(\hat{B} \in \mathbb{R}^{(2M-1)\times(2M+1)}\) through spatial coordinate mapping.

To extract effective local and fine-grained information and improve the segmentation of the liver and spleen, this paper first modifies the swin transformer block to the focal transformer block  [18], The overall structure of the replacement is shown in Fig. 2(b). The most crucial part of the focal transformer block is the Focal Self-Attention (FSA), which can finely attend to features close to the current token and coarsely attend to features that are further away, thereby achieving effective local-global information interaction. Fig. 3 presents the structure of the focal transformer block.

Fig. 3
figure 3

Structure of the focal transformer block

Traditionally, transformers have used flat projections of image blocks or merged features of adjacent 2 × 2blocks, followed by linear processing, to create a multi-level network. However, this method is prone to the loss of a significant amount of fine-grained feature information, which is not conducive to semantic segmentation of small-volume and densely textured tissue organs in abdominal CT images. Therefore, in the second step, after each focal transformer block in the encoder, we add the FFM down sampling layer module and remove the original patch merging  [19]. This can effectively mitigate the impact of the aforementioned issue and enhance the segmentation of small-volume tissues.

The FFM architecture consists of two branches, as shown in Fig. 4. One branch uses dilated convolution to expand the receptive field and capture feature information of small-volume tissue organs. This branch first increases the dimension by applying \(1\times1\) convolution, followed by a \(3\times3\) dilated convolution layer to obtain extensive structural information. Subsequently, global average pooling is used to obtain statistical data of feature maps in the spatial direction (vertical and horizontal). Specifically, the calculation formula for elements in each direction is as follows:

Fig. 4
figure 4

Structure of the feature fusion module (FFM). FFM contains two main branches: the upper one is the first branch, which uses null convolution, by expanding the range of the receptive field in order to obtain the feature information of tissues and organs with smaller size. The lower part is the second branch, which introduces a soft pooling operation aimed at finer downsampling

Fig. 5
figure 5

Validation results for Swin-Unet and FF Swin-Unet. The horizontal axis of the graph represents the training iterations and the vertical axis of the graph represents the DSC

$$v_{h_i}^k=\frac{1}{w}\sum_{j=0}^{w-1}\hat{z}^k\left(i,j\right.)$$
(6)
$$v_{w_{j}}^{k}=\frac{1}{h}\sum_{i=0}^{h-1}\hat{z}^{k}\left(i,j\right.)$$
(7)

in this process, where \(i \in [0,H-1]\), \(j \in [0,W-1]\), and \(k \in [0,C-1]\) represent indices for the vertical direction, horizontal direction, and channels, respectively. The feature \(\hat{z}\) undergoes transformation by the function \(f(z)\), where \(f(\cdot)\) denotes an expanded convolution layer with normalization and GELU activation function. \(V_{h}\) and \(V_{w}\) are contraction weights for the feature maps in spatial directions, and, therefore, their multiplication yields position-related feature maps. Finally, a \(1\times1\) convolution layer is introduced to reduce the feature dimension.

The other branch introduces soft pooling operations aimed at achieving finer downsampling. Soft pooling activates the pixels in the pool kernel exponentially weighted, preserving more detailed information. Subsequently, the features after soft pooling are processed through a convolution layer (dimension increased) to obtain the desired final output.

In summary, the two branches serve different functions: one is for capturing the features of small-volume tissue organs, while the other aims to retain more details. Both branches are equally important, and they are merged in equal proportions to form the output of the downsampling layer.

In the Focal Transformer Block, the self-attention mechanism operates with a base window size of \( M = 7 \), partitioned into \( 3 \times 3 \) sub-windows for fine-grained local attention. The focal levels are set to \( \{1, 2, 4\} \), enabling progressive expansion of the receptive field across hierarchical stages. In the FFM module, the dilated convolution branch employs a dilation rate of \( r = 2 \) with a \( 3 \times 3 \) kernel to capture extended contextual information. The soft pooling branch uses \( 2 \times 2 \) kernels with stride 2, applying exponential activation weights (\( \alpha = 2.0 \)) to preserve spatial details. Both branches are fused with equal weights (\( \lambda_{\text{dilated}} = \lambda_{\text{pool}} = 0.5 \)) after channel-wise concatenation. During training, we adopt a batch size of 16, initial learning rate of \( 1 \times 10^{-4} \), and AdamW optimizer with weight decay \( 0.01 \).

In this paper, we utilize the Medical Image Segmentation Suite (MedicalSeg) for the modeling process of various models. MedicalSeg is a user-friendly, robust, end-to-end 3D medical image segmentation solution that supports the entire workflow from data preprocessing, model training, model evaluation, to model deployment  [20]. The details of the architecture of FF Swin-Unet Network are shown in Algorithm 1.

figure a

Automated severity scoring module (ASSM)

Through the automatic segmentation model established in the previous section, we automatically segmented the liver and spleen for the assessment of the severity of hepatic steatosis. The liver-spleen density ratio method evaluates the severity of hepatic steatosis by measuring the density of the liver and spleen in CT images and calculating the ratio between them  [21]. A higher liver-spleen density ratio is usually associated with a milder degree of hepatic steatosis, while a lower ratio may indicate a more severe degree. The specific relationship between the numerical values and the severity of hepatic steatosis may vary depending on the study and guidelines. The Table 2 shows the reference range commonly used by radiologists in Chinese hospitals  [22]:

Table 2 Description of liver and spleen raw data in MSD

In contrast to the clinical practice plan, we designed a rigorous plan to use the segmentation model to automatically evaluate the severity of hepatic steatosis. First, the established automatic liver and spleen segmentation model is used to predict the CT sequence, and the predicted mask is obtained. Then, the SimpleITK library is used to perform connected domain analysis on each label, and the largest connected domain for each class is retained. After processing, combined with the parameters of the original image, the results are saved as mask format files. In the second step, the 3D maximum boundary range of the liver and spleen in the mask file is found, and 10 cubes are cut out within this range for each organ, and the total CT value of each liver cube and spleen cube is calculated. Finally, the ratio of the total CT values of each liver cube to each spleen cube is calculated, resulting in 100 ratios. The average of these ratios is taken as the final liver and spleen CT ratio of this CT sequence, and the severity of hepatic steatosis is automatically determined according to the table above.

Results

Datasets

We evaluated our segmentation model using a private abdominal CT dataset curated through the SNM protocol. This dataset comprises 98 high-quality cases with radiologist-approved liver/spleen annotations. Three experienced radiologists performed semi-supervised label refinement during SNM processing to ensure anatomical accuracy. The data were split into training (70%), validation (20%), and test (10%) sets.

For training, we applied these augmentation operations:

  • RandomFlip3D (flip_axis= [1, 2])

  • RandomQuarterTurn3D (rotate_planes= [1, 2])

  • RandomRotation3D (degrees = 20, rotate_planes= [1, 2])

  • Resize3D (output_size=\(1\times224\times224\))

Validation/test sets underwent only Resize3D preprocessing. To ensure fair comparisons, the baseline Swin-Unet used identical data initialization parameters.

Implementation details

The FF Swin-Unet model was developed utilizing the MedicalSeg 2.8.0 framework. The architecture of MedicalSeg necessitates the formulation of distinct.yml files for each model during the network architecture phase. These files delineate a multitude of configuration parameters for the training process, encompassing the configuration of training and evaluation datasets, optimizer settings, learning rate schedules, loss function specifications, and model architecture details, among other aspects. The proposed model integrates the Swin-Tiny and Focal-Tiny architectures and undergoes finetuning with pretrained weights. For optimization, Stochastic Gradient Descent (SGD) is employed with a momentum coefficient of 0.9 and a weight decay of 1e-4. The training regimen is defined with a batch size of 2 and a total of 10,000 iterations. The loss function employed is a MixedLoss, which combines CrossEntropyLoss and DiceLoss.

To mitigate the risk of overfitting, transfer learning was implemented with the pretrained weights of Swin-Tiny and Focal-Tiny. All training procedures were executed on a dedicated server equipped with an IntelⓇ XeonⓇ Silver 4214 R CPU operating at 2.40 GHz, 256 GB of RAM, and a 1 Tesla V100 16GB GPU.

Evaluation metrics

We performed a comprehensive quantitative evaluation using four established segmentation metrics: Dice Similarity Coefficient (DSC), 95th percentile Hausdorff Distance (HD95), Intersection over Union (IoU), and Average Symmetric Surface Distance (ASSD). These metrics were calculated as follows:

$$\text{DSC} = \frac{2 \cdot|X \cap Y|}{|X| +|Y|} = \frac{2\text{TP}}{2\text{TP} + \text{FP} + \text{FN}} $$
(8)
$$\text{IoU} = \frac{|X \cap Y|}{|X \cup Y|} = \frac{\text{TP}}{\text{TP} + \text{FP} + \text{FN}} $$
(9)
$$\text{HD95} = \max\left\{\sup\limits_{x\in X}d_{95}(x,Y), \sup\limits_{y\in Y}d_{95}(y,X)\right\} $$
(10)
$$\text{ASSD} = \frac{1}{2}\left(\frac{\sum_{x\in \partial X}d(x,\partial Y)}{|\partial X|} + \frac{\sum_{y\in \partial Y}d(y,\partial X)}{|\partial Y|}\right)$$
(11)

where \(X\) and \(Y\) denote the predicted and ground truth segmentation masks, \(\partial\) represents surface voxels, and \(d_{95}\) indicates the 95th percentile of Euclidean distances. TP/FP/FN represent true positives, false positives, and false negatives respectively.

The DSC and IoU quantify volumetric overlap accuracy, with values ranging [0,1] where 1 indicates perfect overlap. HD95 measures boundary alignment robustness by computing the 95th percentile of maximum surface distances between \(X\) and \(Y\) surfaces, reducing sensitivity to outliers. ASSD provides complementary surface distance analysis through symmetric averaging of mean surface distances.

This multi-metric approach enables comprehensive assessment of both regional overlap accuracy (DSC/IoU) and boundary precision (HD95/ASSD). All metrics were computed per-slice and averaged across test volumes, with separate evaluations for different anatomical structures.

Quantitative evaluation

The proposed FF Swin-Unet demonstrated improved training dynamics compared to the baseline Swin-Unet. Specifically, FF Swin-Unet achieved faster convergence with a 19.5% lower final training loss (0.020 vs. 0.025) and a marginally higher Dice similarity coefficient (97.22% vs. 97.12%) after 10,000 iterations. Notably, FF Swin-Unet exhibited more stable gradient descent without mid-training performance fluctuations, as illustrated in Fig.5. These results suggest that the focal feature fusion mechanism enhances both optimization efficiency and robustness during model training.

Fig. 7 presents the performance of different models on the validation dataset. From the graph, it is apparent that both models exhibit fluctuations in their results during the early stages of training when evaluated on the validation set. However, Swin-Unet experiences more severe fluctuations with a wider range of variations. This may be due to the baseline model’s architecture having insufficient data fitting capacity, resulting in significant fluctuations in loss and performance metrics in the initial stages. In contrast, the proposed FF Swin-Unet architecture demonstrates stronger data fitting capabilities. In the later stages of training, both models exhibit stabilized DSC values. Upon comprehensive analysis of the entire graph, regardless of the final results or training stability, the proposed FF Swin-Unet consistently outperforms the baseline Swin-Unet method.

Fig. 6
figure 6

Segmentation results of different existing ways and our method on labeled data from the test set

Fig. 7
figure 7

Segmentation results of different existing ways and our method on unlabeled data

Tables 3 and 4 present the comprehensive quantitative comparison across four segmentation architectures. The proposed FF Swin-Unet achieves superior performance in both volumetric overlap (DSC/IoU) and boundary accuracy (HD95/ASSD) metrics. For spleen segmentation (class 2), our model demonstrates particularly notable improvements with 92.92% DSC and 88.76% IoU, outperforming nnU-Net by 4.46% and 5.59% respectively in these metrics. The ASSD measurements further validate the anatomical precision of our method, showing 1.28 mm surface distance for spleen segmentation − 46.0% reduction compared to nnU-Net’s 2.37 mm. These quantitative enhancements are visually corroborated by the segmentation examples in Figs. 6 and 7, particularly in preserving splenic contours and hepatic boundary details.

Table 3 Comparison of DSC and IoU metrics for different models
Table 4 Comparison of HD95 and ASSD metrics for different models

The quantitative evaluation results demonstrate that our proposed FF Swin-Unet exhibits superior performance metrics in multi-model comparisons. As shown in Table 3, regarding DSC and IoU metrics, our model achieves improvements of 1.42% and 2.43% respectively in spleen segmentation (DSC-2: 92.92%, IoU-2: 88.76%) compared to the baseline Swin-Unet, validating the effectiveness of the FFM module in preserving features of small organs. Although V-Net and nnU-Net attain average DSC values of 90.57% and 93.30% respectively, Transformer-based architectures demonstrate significant advantages in boundary accuracy metrics. The ASSD data in Table 4 further corroborates this conclusion, showing that our model achieves surface distance metrics of 1.12 mm and 1.28 mm for liver and spleen respectively, representing 43.4% and 46.0% improvements over nnU-Net. These enhancements primarily stem from the local-global feature interaction mechanism of the Focal Transformer and the multi-scale feature fusion strategy of the FFM module.

Compared to traditional CNN architectures like V-Net and nnU-Net, the significant improvement in surface distance metrics (ASSD) indicates that the Transformer architecture better captures anatomical boundary features of organs. Specifically, the reduction of spleen ASSD from 2.37 mm in nnU-Net to 1.28 mm verifies the effectiveness of the soft pooling branch in the FFM module for preserving details of small organs. In clinical diagnostics, this sub-millimeter precision improvement is crucial for accurate calculation of liver-spleen CT value ratios, effectively preventing severity misclassification caused by partial volume effects.

The comprehensive ablation study in Table 5 reveals the progressive improvements from individual components. The Focal Transformer contributes 0.27% DSC and 0.83% IoU gains while reducing HD95 by 1.44 mm, demonstrating its effectiveness in enhancing global context modeling through multi-scale attention. The FFM module provides complementary benefits with 0.15% DSC improvement and 1.22 mm HD95 reduction, validating its capacity to preserve fine anatomical details via soft pooling operations. When synergistically combined, the full configuration achieves maximum 1.45% DSC improvement (95.19%\(\rightarrow\)95.64%) and 24.3% ASSD reduction (1.54 mm\(\rightarrow\)1.20 mm) over baseline, confirming the modules’ orthogonal enhancements. The HD95 metric shows cumulative enhancements (−1.71 mm total) with each added component, indicating progressive boundary refinement. Notably, the ASSD improvement dominates in the final stage (0.26 mm additional reduction), suggesting the full model’s superiority in surface accuracy - crucial for clinical measurements.

Table 5 Ablation study of key components in FF swin-unet

Comparative analysis (Table 6) demonstrates significant advancements across three critical dimensions. The proposed method achieves a 2.42% improvement in spleen segmentation accuracy (92.92% vs 90.5% DSC) compared to the strongest baseline SwinUNETR, validating the effectiveness of our hierarchical attention mechanism for small organ analysis. This is complemented by a 10.4% reduction in boundary segmentation error, evidenced by the HD95 metric decreasing from 17.8 mm to 15.94 mm, which we attribute to the boundary-sensitive design of our feature fusion module. Notably, these improvements are achieved alongside substantial efficiency gains: our system processes cases in 5 seconds, representing an 8-fold speedup over TransUNet (40s), 17-fold acceleration compared to nnFormer (85s), and 3.6-times faster inference than SwinUNETR (18s), while maintaining 90% severity classification accuracy − 2% points higher than SwinUNETR’s 88%.

Table 6 Comparative analysis with state-of-the-art methods

The architectural efficiency is further underscored by parameter comparisons, where our method requires 13% fewer parameters than SwinUNETR (54.3 M vs 62.4 M) despite superior performance. Comprehensive multi-organ analysis reveals consistent advantages, with liver segmentation accuracy reaching 94.42% (0.72% higher than SwinUNETR) and spleen DSC outperforming all baselines by 2.02–5.62% points. Boundary precision metrics show particularly striking improvements, with our HD95 values being 25.6–33.6% lower than conventional transformer architectures. This combination of sub-second inference speed (5s vs 150s manual analysis), dual-organ performance superiority, and compact model size positions our framework as a clinically viable solution for automated abdominal organ screening.

Qualitative evaluation

First, we assessed the proposed method from the perspective of visual quality. We compared the proposed FF Swin-Unet with three other state-of-the-art methods in the field of image semantic segmentation, namely V-Net  [26], nnU-Net  [27], and Swin-Unet. V-Net and nnU-Net methods are both based on traditional CNN structures, while Swin-Unet serves as the baseline model. Different pre-training parameters were employed for these networks.

Fig. 6 presents the segmentation results on labeled test data using the proposed FF Swin-Unet and the three aforementioned methods. We randomly selected original labeled data with the identifier 70 from the test set and randomly extracted slices 88, 98, and 108. The segmentation results provided by these methods are presented in columns 2 to 5. The first to third rows in Fig. 6 show the segmentation results for three slices. Evidently, the proposed FF Swin-Unet is the most effective method for visually detecting the liver and spleen from abdominal CT images. Judging by the segmentation performance on the lower-right spleen in Fig. 6, our FF Swin-Unet exhibits superior perceptual capabilities for small target objects like the spleen compared to other models. Thus, our Focal Feature Module (FFM) compensates for local features easily lost during downsampling. Alternatively, the proposed network can obtain this information by deforming computation to focus on shape differences, which can be compared between different models in the first row of Fig. 6.

For large objects to be segmented (the red region on the left in the image represents the liver), FF Swin-Unet also performs well. This is attributed to our Focal Transformer architecture, which achieves local-global information interaction through fine-grained feature attention, preserving the boundary regions of objects. It is important to emphasize that for the automated severity assessment of fatty liver, mastery over boundary regions, especially in the segmentation of small targets like the spleen, is crucial. In this regard, FF Swin-Unet has demonstrated significant expertise.

Next, we randomly selected test data with the identifier 93 for liver and spleen segmentation. This dataset lacks labels, and we randomly extracted slices 163, 183, and 205. Fig. 7 presents the segmentation results on unlabeled test data obtained using the proposed FF Swin-Unet and the three aforementioned methods. By comparing it with other methods, it is evident that this approach can identify the liver and spleen regions from abdominal CT images. Similar to the segmentation performance on the labeled dataset, the segmentation predictions indicate that for dense and small-scale targets, our FF Swin-Unet outperforms the baseline framework and other models in abdominal CT image segmentation. The segmentation results on unlabeled data have also been validated by radiology experts.

Severity analysis results

To evaluate our proposed automated assessment method for fatty liver grade, we randomly selected 10 original CT scan slices and invited radiology experts to manually measure the liver-spleen ratio. The measurement results were then compared with the automatic analysis results obtained using the FF Swin-Unet model presented in this paper. The manual measurement method by radiologists strictly adhered to clinical protocols, and to minimize errors, the average of three measurements was taken. The liver-spleen ratio was ultimately mapped to the corresponding fatty liver grade within specified ranges. Table 7 presents the results of manual and automated measurements.

Table 7 Comparison between the method proposed in this paper and expert manual measurement of the grade of fatty liver cases

According to the analysis in Table 7, the automatic analysis results have a small difference with the average value of manual measurements in the calculation of the specific liver-spleen ratio, especially on the boundary value of 1, but they are basically consistent. However, the difference between the manual measurement result of case number 5 (0.662) and the automatic measurement result (0.780) is large, which may lead to an error in fatty liver grade judgment, possibly due to segmentation boundary error. The last row of the table shows that the overall accuracy of the automatic measurement method is 90%, which can prove the effectiveness of our proposed method to some extent.

Moreover, the Mann-Whitney U test was employed to assess whether there were differences between manual and automated measurements. Based on histogram analysis, it was determined that the shape of the distribution of measurements in both groups was essentially similar. The average for manual measurements was 0.859, with a median of 0.906, while the average for automated measurements was 0.855, with a median of 0.845. The Mann-Whitney U test results indicate that there is no significant difference between manual and automated measurements (U = 47.000, p = 0.853). The results are summarized in the Table 8.

Table 8 Conclusions of the statistical analysis of the automated severity scoring module (ASSM) results

The implications of these results are substantial. The lack of significant difference between manual and automated measurements underscores the high level of accuracy and reliability of the automated system. This suggests that the automated measurements can be confidently used as a surrogate for manual measurements, which is particularly important in clinical settings where time and resources are at a premium. The equivalence in performance also implies that the automated system has the potential to streamline workflow, reduce operator variability, and increase the efficiency of data analysis without compromising on the quality of the results. Furthermore, these results bolster the validity of our model for potential integration into clinical practice, where the consistency and replicability of measurements are critical for patient care and outcome assessment.

Discussion

The development of big data technology and the application of artificial intelligence (AI) algorithms have provided new methods and possibilities for better understanding NAFLD  [28]. The advantage of AI over traditional statistical modeling is that it can identify unique patterns and combine multiple factors to create predictive models, risk stratification, and outcomes. It is particularly suitable for chronic diseases because it has heterogeneity, complexity, and overlapping factors [29]. We conducted a retrospective study to validate the independent performance of a new artificial intelligence application for liver and spleen segmentation and the automatic grading of NAFLD severity on abdominal CT images. The data were sourced from The Medical Segmentation Decathlon (MSD) and a high-quality liver and spleen annotation dataset was constructed using the semi-automatization nnU-Net module (SNM). This dataset selection and construction method is crucial for training deep learning models and evaluating their performance. By using a well-annotated dataset, researchers can enhance the accuracy and generalization capability of the model and ensure the reliability of the results.

This paper introduces the Focal Feature Fusion Swin-Unet module (FSM), an NAFLD segmentation detection model aimed at improving the segmentation of the liver and spleen. Quantitative evaluation results indicate that, compared to the original Swin-Unet architecture, our proposed FF Swin-Unet achieves better segmentation performance on evaluation metrics. Specifically, the average DSC is 95.64%, and the average HD95 is 15.94. Compared to the baseline Swin-Unet method, FF Swin-Unet demonstrates improvements of 0.45% in Average DSC and 1.71 in Average HD95. Compared to existing segmentation methods, the proposed FF Swin-Unet excels in terms of both DSC and HD95. This improvement can be attributed to the design of the main Focal Transformer module and the introduction of the downsampling layer FFM module. These features help the model adapt to cases with significant shape differences and blurred boundaries in the liver and spleen, areas where other models typically fall short.

This paper introduces the automated severity scoring module (ASSM) method to combine with the segmentation model for NAFLD severity grading. By extracting cubes of a certain volume from the liver and spleen and calculating the ratio of CT values, automatic identification and severity assessment of NAFLD can be achieved. Research results demonstrate an accuracy rate of 90% on the test set, highlighting the potential clinical utility of this method. It can alleviate the burden on radiologists and provide patients with more accurate and rapid diagnostic results. Automated NAFLD severity assessment methods can help doctors gain a better understanding of a patient’s condition and guide treatment decisions effectively.

Through a literature review, we found that there were few other similar studies. In the limited relevant studies, other deep learning-based solutions were found to have certain limitations in different aspects compared to the method in this article. Graffy et al. [12] developed a deep learning-based automatic liver fat quantification tool to determine the prevalence of steatosis in a large screening queue using non-contrast CT. By using a three-dimensional convolutional neural network, including a sub-queue with subsequent scans, volume-based automatic liver attenuation analysis was analyzed, including conversion to CT fat scores, and compared with manual measurements in a large number of scans. The results showed that this CT-based fully automatic liver fat quantification tool can be used for population-based liver steatosis and NAFLD assessments, and the objective data are in good agreement with manual measurement results. However, the methodological and algorithmic descriptions were fuzzy and could not be practically used by subsequent researchers. Huo et al. [30] proposed an automatic liver attenuation ROI measurement (ALARM) method based on automatic ROI extraction from liver attenuation. It consists of two main stages: (a) liver segmentation based on deep convolutional neural networks (DCNN) and (b) automatic ROI extraction. DCNN combined with morphological operations can achieve “excellent” consistency with manual estimation for detecting fatty liver. The entire pipeline is implemented as a Docker container, but it takes 5 minutes to complete the liver attenuation evaluation, which is time-consuming. As a radiological tool, in quantifying liver fibrosis, the derivative parameters of image-based texture analysis combined with ML of non-contrast-enhanced T1-weighted magnetic resonance images are similarly accurate as magnetic resonance elastography, but only 82% [11].

The proposed dual-mode deployment framework demonstrates substantial computational efficiency improvements for clinical implementation. Through synergistic optimizations including magnitude-based pruning (18% head removal) and 8-bit quantization (achieving 4.2 \(\times\) model compression), the system achieves 2.1s/case inference latency on NVIDIA T4 GPUs (95.1% DSC) and 9.8s/case on Xeon CPUs (94.7% DSC), corresponding to a 51–71 \(\times\) acceleration compared to manual measurement (1.8–2.5 mins/case). The adaptive memory footprint scaling from 2.1GB (CPU) to 18GB (GPU) supports flexible deployment across varied hospital infrastructures. Clinical validation confirms its reliable performance in real-world NAFLD screening scenarios, including resource-constrained rural clinics without dedicated GPUs. This automated system enhances NAFLD diagnosis by significantly reducing time requirements for segmentation and severity assessment. The rapid generation of accurate severity scores supports timely clinical decision-making and personalized treatment planning, potentially improving patient management. The integration of AI into clinical workflows optimizes radiological resource utilization while maintaining diagnostic quality, particularly valuable for standardizing screening in diverse healthcare settings.

Although this study has yielded promising outcomes, it is not without limitations that warrant further investigation. First, the dataset was derived from the MSD dataset, potentially introducing sample selection bias. Future studies should employ multi-center datasets with expanded sample sizes to validate the model’s robustness and generalizability. Second, institutional biases persist in medical imaging data, including residual protocol variations (e.g., 5–10 HU differences due to contrast timing discrepancies) despite preprocessing, reflecting real-world clinical heterogeneity. Third, the single-annotator labeling protocol (where each dataset was annotated by a single radiologist), while mirroring actual clinical workflows, may introduce labeling noise. Furthermore, although high accuracy was achieved in lesion segmentation and severity assessment, there remains room for error reduction. Subsequent work should refine training strategies, expand training datasets, and integrate additional imaging features with clinical data to improve diagnostic precision. Notably, while the automated method alleviates radiologists’ workload, physicians’ professional judgment and clinical validation remain irreplaceable. Deep learning models should be positioned as decision-support tools rather than replacements for clinical expertise. To ensure continuous improvement, explicit clinical feedback mechanisms must be established, enabling iterative model optimization based on real-world deployment experiences.

The objective of this research is to engineer a proficient and intuitive automated system for medical professionals. The system interface is designed to facilitate the input of a patient’s abdominal CT scans. Subsequently, an automated workflow is initiated, which encompasses programs for the autonomous 3D segmentation of the liver and spleen. This process culminates in the assessment of the non-alcoholic fatty liver disease (NAFLD) severity score by employing multiple regions of interest calculations. This study introduces an integrated Python-API-based platform: NAFLD Severity Scoring System, tailored for abdominal CT imaging data. Upon uploading the patient’s abdominal CT scans, the system executes the following automated processes: it employs the FF Swin-Unet al.gorithm for image enhancement, performs 3D segmentation on the liver and spleen, and computes the NAFLD severity score in accordance with radiological standards. The system not only automatically localizes the liver and spleen within the CT images but also aids clinicians in evaluating the NAFLD severity by integrating the radiological NAFLD severity scoring criteria with the analysis of 3D regional CT value ratios. The system boasts high efficiency, classification accuracy, and maintains a robust level of precision, offering significant utility to the medical professionals.

Conclusions

Utilizing abdominal CT image data, complemented by NAFLD-specific radiological characteristics, we have developed an innovative technique for the segmentation of NAFLD and the automated assessment of disease severity. This method stands on the shoulders of an enhanced Swin-Unet architecture, incorporating a semi-automated dataset construction module (SNM), a Focal Feature Fusion Swin-Unet module (FSM) tailored for precise liver and spleen segmentation, and an Automated Severity Scoring Module (ASSM) for NAFLD grading. Our experimental outcomes demonstrate that this approach outperforms existing state-of-the-art methods in both the segmentation of liver and spleen and the automation of NAFLD severity scoring.

The implications of our study are significant, offering a robust tool for clinicians to not only visualize NAFLD but also to quantitatively assess its severity, which is pivotal for treatment planning and monitoring. Looking ahead, the expansion of our dataset with more labeled images will undoubtedly refine the model’s performance. Moreover, the integration of diverse medical imaging modalities has the potential to enhance diagnostic precision and reliability. To enhance model interpretability, the provision of segmented three-dimensional images will empower practitioners with a clearer understanding of the model’s decision-making process. For severity evaluation, while our current approach is based on global liver properties, future research may explore more granular methodologies to refine the assessment’s accuracy and dependability.