11institutetext: Department of Computer Science, Illinois Institute of Technology, USA
22institutetext: School of Computer Science, Peking University, China
33institutetext: Department of Cell and Developmental Biology, University of Michigan, USA
44institutetext: Department of Computer Science, University of Illinois Chicago, USA
44email: [email protected], [email protected],
[email protected], [email protected], [email protected]

Self-Prompt SAM: Medical Image Segmentation via Automatic Prompt SAM Adaptation

Bin Xie 11    Hao Tang 22    Dawen Cai 33    Yan Yan 44    Gady Agam 11
Abstract

Segment Anything Model (SAM) has demonstrated impressive zero-shot performance and brought a range of unexplored capabilities to natural image segmentation tasks. However, as a very important branch of image segmentation, the performance of SAM remains uncertain when applied to medical image segmentation due to the significant differences between natural images and medical images. Meanwhile, it is harsh to meet the SAM’s requirements of extra prompts provided, such as points or boxes to specify medical regions. In this paper, we propose a novel self-prompt SAM adaptation framework for medical image segmentation, named Self-Prompt-SAM. We design a multi-scale prompt generator combined with the image encoder in SAM to generate auxiliary masks. Then, we use the auxiliary masks to generate bounding boxes as box prompts and use Distance Transform to select the most central points as point prompts. Meanwhile, we design a 3D depth-fused adapter (DfusedAdapter) and inject the DFusedAdapter into each transformer in the image encoder and mask decoder to enable pre-trained 2D SAM models to extract 3D information and adapt to 3D medical images. Extensive experiments demonstrate that our method achieves state-of-the-art performance and outperforms nnUNet by 2.3% on AMOS2022 [19], 1.6% on ACDC [2] and 0.5% on Synapse [21] datasets.

Keywords:
Medical Image Segmentation Automatic Prompt SAM

1 Introduction

The purpose of medical image segmentation is to utilize medical images to segment specific anatomical structures, including organs, lesions, and tissues, which can aid in many clinical applications. Deep learning methods [27, 1, 6, 35, 16, 30] have made remarkable and numerous progress in the field of medical image segmentation in the past few years. However, existing deep learning models are often tailored, which have a strong inductive bias and limit their capacity.

The rise of foundation models [3, 25] that are trained on large and diverse datasets has revolutionized artificial intelligence. Benefiting from their remarkable zero-shot and few-shot generalization abilities, a wide range of downstream tasks that adapt a pre-trained model to specific tasks [13, 26, 15, 33] achieve remarkable progress, not like traditional methods of training task-specific models from scratch. Recently, SAM [20], pre-trained over 1 billion masks on 11 million natural images, has been proposed as a visual foundation model for prompt-driven image segmentation and has gained huge attention due to its impressive zero-shot performance. Based on its strong capabilities in natural image segmentation, can SAM still maintain strong performance when applied to medical image segmentation, though significant differences between natural images and medical images?

Refer to caption
Figure 1: (a) Predictions of different combinations of points, boxes, and masks from ground truth. (b) Experiments for different methods to select points. (c) The intensity distribution and prediction of the output of SAM. (d) Incontinuity at the depth dimension.

It is infeasible to apply directly. We present the intrinsic issues when directly applying SAM to medical image segmentation as follows. i) The first issue is that SAM needs extra prompts when segmenting specific regions. It is harsh to expect all users to have medical knowledge to provide points in specific regions or frame out specific regions. ii) The second issue is that SAM does not have the functionality to predict semantic information for predicted binary masks. SAM only predicts one binary mask for each prompt without any semantic labels. However, medical images usually have multiple labels and each label has semantic information. iii) The third issue is which combination of prompt ways achieves the best performance when proper prompts are provided. SAM needs users to provide mandatory prompts that could be either points and boxes or both, and an optional prompt, masks. iv) The fourth issue is that directly applying SAM to medical image segmentation tasks without any modification or fine-tuning does not always obtain good performance when proper prompts are provided. Many works [7, 14, 36, 24] have demonstrated that SAM is imperfect or even fails when some situations occur, such as weak boundaries, low-contrast, and irregular shape, which is consistent with other investigations [17, 18]. Figure 1(a) illustrates the results using different prompts. Even if the prompts are generated by the ground truth, the results are very bad, especially by point prompts.

To solve the requirements of extra prompts, we propose a multi-scale prompt generator (MSPGenerator) combined with the image encoder that employs the Vision Transformer (ViT) [8] pre-trained with masked auto-encoder [12] as the backbone to generate auxiliary multi-class masks that can produce bounding boxes as box prompts and points as point prompts instead of manual prompts. To solve the functionality to provide semantic information for predicted binary masks, we utilize the generated auxiliary multi-class masks, which can be encoded to one-hot binary masks whose number is the same number of classes. Then, we utilize the location in the channel dimension of the one-hot binary masks to represent the semantic labels.

After acquiring the prompts, we investigate the optimal prompt method for medical image segmentation. In Figure 1(a), we found the most robust prompt way is by bounding boxes with proper points that should select the point farthest from the boundary of each object, which means that the point is as central as possible shown in Figure 1(a). Therefore, we adopt the Euclidean distance transform to calculate the distance from the boundaries and obtain the candidate points based on the auxiliary masks.

The final part is to explore how to adapt the original SAM works from 2D natural image segmentation to 3D medical image segmentation. Directly applying SAM to medical image segmentation tasks does not always obtain good performance when proper prompts are provided. Therefore, fine-tuning SAM for medical image segmentation tasks is the main direction. However, fine-tuning the large model of SAM consumes huge computational resources. This problem can be solved by parameter-efficient fine-tuning [13, 15, 33], such as inserting trainable lightweight adapters [13] that prove its feasibility on SAM by the works [31, 23, 29, 22, 9], lightly modifying large models, and freezing the rest of the structures to fine-tune large models efficiently. The question of how to appropriately modify the structure has become the most important issue. To maximize the utilization of the capabilities of SAM, our criteria are to keep all structures, freeze all weights, and only add blocks into SAM to adapt. In this way, we retain the zero-shot capabilities of SAM and adapt SAM to medical image segmentation. Meanwhile, we design several additional blocks to do adaptation.

Refer to caption
Figure 2: The overview architecture of the proposed Self-Prompt-SAM.

In summary, our contributions to this paper are as follows: (i) We propose a novel self-prompt SAM (Self-Prompt-SAM) framework for medical image segmentation. To the best of our knowledge, the proposed Self-Prompt-SAM is the first SAM-based image segmentation framework without any prompts provided; (ii) We propose a novel multi-scale hierarchical prompt generator (MSPGenerator) that utilizes multiple levels of feature maps from the image encoder to generate auxiliary masks for the prompts. Through massive experiments, we found that the best prompt way is to combine bounding boxes, points (use Euclidean distance transform to generate candidate points), and masks; (iii) We design a depth fused adapter (DFusedAdapter) to enable pre-trained 2D SAM models to extract 3D information. (iv)We conduct extensive experiments on three challenging AMOS2022 [19], ACDC [2] and Synapse [21] datasets. The results demonstrate that Self-Prompt-SAM achieves state-of-the-art performance.

2 Methodology

Figure 2 illustrates our Self-Prompt-SAM, including a modified image encoder, a designed MSPGenerator, the prompt encoder, and a modified mask decoder.

2.1 Proposed DFusedAdapter

Since most medical images have an extra depth dimension compared to 2D natural images, it is inevitable to lose 3D spatial information and cause spatial incontinuity in depth dimension if SAM is applied directly to a sequence of 2D frames shown in Figure 1(d). There exists an incontinuity between the 3rd frame and the 4th frame at the depth dimension when we use SAM without any operation of the depth information. To solve the incontinuity in depth dimension, we design the DFusedAdapter shown in Figure 2(b) with the ability to explore depth information by adding an invert-bottleneck structure consisting of two FC layers processing in depth dimension with an activation layer in the middle of the original adapter with a skip connection. Inspired by AIM [33], we introduce an adapter after multi-head self-attention and in parallel to the MLP layer for each transformer block. Thus, our model can learn the extra-depth information.

2.2 Modified Image Encoder

The modified image encoder is illustrated in Figure 2(a). Firstly, the given images with varied modalities will be adapted and reshaped to 3 channels by our designed MAdapter. Then, the adapted images will be fed to the image encoder, which consists of a patch embed block, a positional embedding block that adds a learnable depth positional embedding block to learn extra depth information, and a series of transformer blocks that we insert our designed DFusedAdapter after the MSA and in parallel to the MLP layer to adapt SAM to medical image segmentation and learn extra depth information. Meanwhile, we extract several feature maps for the MSPGenerator.

Proposed MAdapter. SAM works on natural images with 3 channels for RGB while medical images have varied modalities. To solve the issue of how to adapt varied modalities to RGB channels, we design an invert-bottleneck architecture built via a sequence of convolutional layers, named MAdapter, at the very beginning of the image encoder, which can learn the adaption during fine-tuning.

2.3 Proposed MSPGenerator

In Figure 2(c), we illustrate our proposed MSPGenerator to solve the requirements of extra proper prompts provided. We extract 5 feature maps from different levels of the image encoder. All extracted feature maps are fed to our designed MSPGenerator to generate auxiliary multi-classes masks. The MSPGenerator is a hierarchical structure built by convolutional and transpose convolution layers. Starting with the deepest feature map, it is gradually upsampled to 2x the size and then concatenates with shallower feature maps. Finally, we obtain the auxiliary masks of the same size as the final segmentation. To alleviate the gradient vanishing and converge quickly, we involve deep supervision in the MSPGenerator by adding supervision loss at different levels. Then, we utilize the auxiliary masks to generate a point, a bounding box, and a mask for each class.

2.4 Proposed Prompt Way

After we obtain auxiliary masks that can produce points and bounding boxes, we should consider which combination of points, boxes, and masks is the best way to segment the medical image. Therefore, we conduct experiments that use ground truth to generate points, bounding boxes, and masks for each class as candidate prompts to find the best combination shown in Figure 1(a). Experiments demonstrate that the prompts of points with or without masks fail, as SAM almost segments the entire chest as the liver class (the purple region). The failed reason is that the liver region in the raw images has weak boundaries and is similar to other regions. When involving bounding boxes, each class can be located in the corrected region though there are errors. The best performance is the prompt for the combination of points, bounding boxes, and masks. Therefore, we chose the combination of points, bounding boxes, and masks as our model’s prompt. However, the selection of points can bring about enormous differences in performance. The criterion is that the selected point should be as representative of a specific object as possible and inside the mask. It means that the point should be as central as possible in the masks. In other words, the points farthest from the boundaries should be selected as the point prompt. Figure 1(b) shows the results of different points with or without bounding boxes. There are enormous differences in performance if randomly selecting points in masks. The performance of selecting the central point of the bounding boxes is not the best and the central point is not always located on masks, such as the Myo class (green region) in Figure 1(d). When we use the Euclidean distance transform to calculate the distance from the boundaries for each pixel and obtain the candidate points farthest from the boundaries as the point prompt, the performance is the best, which is shown at the right-bottom of Figure 1(b).

2.5 Modified Mask Decoder

Figure 2(d) illustrates our proposed mask decoder. Since there is an image positional embedding for the mask decoder, we add a learnable depth positional embedding block with image positional embedding to learn extra depth information. The mask decoder consists of two transformers and each transformer consists of two cross-attention blocks, which we also insert DFusedAdapter to do adaption and learn extra depth information. Finally, we would obtain a series of binary segmentation masks for each class. To properly process multi-class segmentation, we equip our proposed M(ulti)C(lasses)-Adapter which is an invert-bottleneck structure that contains several convolutional layers with a softmax function to adapt binary segmentation to multi-class segmentation.

Proposed MC-Adapter. The original SAM segments all possible objects by binary masks and does not classify each object belonging to which class, which may result in some pixels being considered belonging to more than 2 classes or not belonging to any class. Our expectation of SAM is to predict each pixel to one specific class given raw images as input, like traditional medical image segmentation. After we obtain all binary masks for the total classes, we observe each binary mask for a certain class has a different distribution from other outputs, since each output is generated by a specific prompt and trained by a sigmoid function. Therefore, it will obtain very bad results if we directly use a softmax function for all output, which is shown in Figure 1(c). We show two different outputs of SAM for the background and liver class. When we individually consider each output, both of the areas of the red boxes are not considered to belong to its class since the intensities of all pixels are smaller than 0, in the range of [-2, -1) and [-1, 0) for background class and liver class, respectively. However, the area of the red box is considered as the liver class finally when we adopt a softmax function for all outputs since the output intensity for the liver class is the largest but less than 0. Therefore, to adapt the difference and classify each pixel to only one class, we also design an invert-bottleneck architecture that consists of two convolutional layers to adapt binary segmentation to multi-class segmentation, named MC-Adapter.

3 Experiments

Method Spleen R.Kd L.Kd GB Eso. Liver Stom. Aorta IVC Panc. RAG LAG Duo. Blad. Pros. Average
TransBTS [28] 0.885 0.931 0.916 0.817 0.744 0.969 0.837 0.914 0.855 0.724 0.630 0.566 0.704 0.741 0.650 0.792
UNETR [11] 0.926 0.936 0.918 0.785 0.702 0.969 0.788 0.893 0.828 0.732 0.717 0.554 0.658 0.683 0.722 0.762
nnFormer [35] 0.935 0.904 0.887 0.836 0.712 0.964 0.798 0.901 0.821 0.734 0.665 0.587 0.641 0.744 0.714 0.790
SwinUNETR [10] 0.959 0.960 0.949 0.894 0.827 0.979 0.899 0.944 0.899 0.828 0.791 0.745 0.817 0.875 0.841 0.880
nn-UNet [16] 0.965 0.959 0.951 0.889 0.820 0.980 0.890 0.948 0.901 0.821 0.785 0.739 0.806 0.869 0.839 0.878
MaskSAM (Ours) 0.961 0.969 0.965 0.856 0.871 0.978 0.938 0.959 0.918 0.882 0.790 0.809 0.847 0.916 0.849 0.901
Table 1: The comparison of results on the AMOS testing dataset on the leaderboard.
Method DSC\uparrow Aotra \uparrow Gallbladder \uparrow Kidnery(L) \uparrow Kidnery(R) \uparrow Liver \uparrow Pancreas \uparrow Spleen \uparrow Stomach \uparrow
TransUNet [6] 77.48 87.23 63.16 81.87 77.02 94.08 55.86 85.08 75.62
SwinUNet [5] 79.13 85.47 66.53 83.28 79.61 94.29 56.58 90.66 76.6
UNETR [11] 79.56 89.99 60.56 85.66 84.80 94.46 59.25 87.81 73.99
nnUNet [16] 86.21 92.39 71.71 86.07 91.46 95.84 82.92 90.31 79.01
nnFormer [35] 86.57 92.40 70.17 86.57 86.25 96.84 83.35 90.51 86.83
SAMed [34] 81.88 87.77 69.11 80.45 79.95 94.80 72.17 88.72 82.06
SAMed_s [34] 77.78 83.62 57.11 79.63 78.92 93.98 65.66 85.81 77.49
SAM3D [4] 79.56 89.57 49.81 86.31 85.64 95.42 69.32 84.29 76.11
Self-Prompt-SAM (Ours) 86.74 91.99 69.95 85.65 85.40 97.39 79.18 94.38 89.94
Table 2: Quantitative results on Synapse dataset (DSC in %).

Datasets and Evaluation Metrics. We use three publicly available datasets in our experiments, AMOS22 Abdominal CT Organ Segmentation [19], Synapse multiorgan segmentation (Synapse) [21], and Automatic Cardiac Diagnosis Challenge (ACDC) [2]. (i) AMOS22 dataset consists of 200 cases of abdominal CT scans with 16 anatomies manually annotated for abdominal multi-organ segmentation. and we evaluate 200 test images using our model on the AMOS22 leaderboard. (ii) Synapse dataset consists of 30 cases of abdominal CT scans. Following the split strategies [6], we use a random split of 18 training cases and 12 cases for validation. We evaluate model performance via the average Dice score (DSC) on 8 abdominal organs. (iii) ACDC dataset consists of 100 patients with labels on the right ventricle (RV), myocardium (MYO) and left ventricle (LV). We use a random split of 70 training cases, 10 validation cases, and 20 testing cases. We evaluate performance by the average DSC.

3.1 Comparison with State-of-the-Art Methods

Results on AMOS22. We compare Self-Prompt-SAM with the methods that are widely used and well-recognized in the community, including the convolution-based method (nnUNet [16]), transformer-based methods (UNETR [11], SwinUNETR [10], and nnFormer [35]). To fair comparison, all results are based on 5-fold cross-validation without any ensembles. Table 1 shows that Self-Prompt-SAM outperforms all existing methods in most organs, achieving a new SOTA performance in DSC. Specifically, it surpasses nnUNet and SwinUNETR by 2.3% and 2.1% in DSC, respectively, confirming the efficacy of our method.

Results on Synapse. We compare Self-Prompt-SAM with several leading SAM-based method(i.e., SAMed [34] and SAM3D [4]), convolution-based methods (i.e., nnUNet [16]) and transformer-based methods (i.e., nnFormer [35]). Table 2 shows that Self-Prompt-SAM outperforms all existing methods and achieves a new SOTA performance. Specifically, our model surpasses SAMed, nnUNet, and nnFormer by 4.9%percent4.94.9\%4.9 %, 0.5%percent0.50.5\%0.5 %, 0.0%percent0.00.0\%0.0 %, and 0.7%percent0.70.7\%0.7 % in DSC. Meanwhile, our model predicts well on the large-size labels, ‘Liver’, ‘Spleen’, and ‘Stomach’, due to our proposed DFusedAdapter can learn more 3D spatial information and adapt 2D SAM to medical image segmentation. Fig. 3 demonstrates that Self-Prompt-SAM can predict more accurately the ‘Liver’, ‘Spleen’, and ‘Stomach’ labels, demonstrating the effectiveness of our method.

Refer to caption
Figure 3: Qualitative comparison on the Synapse and ACDC dataset.
Method Average \uparrow RV \uparrow Myo \uparrow LV \uparrow
R50-U-Net [27] 87.55 87.10 80.63 94.92
VIT-CUP [8] 81.45 81.46 70.71 92.18
R50-VIT-CUP [8] 87.57 86.07 81.88 94.75
UNETR [11] 88.61 85.29 86.52 94.02
TransUNet [6] 89.71 88.86 84.54 95.73
SwinUNet [5] 90.00 88.55 85.62 95.83
LeViT-UNet-384s [32] 90.32 89.55 87.64 93.76
nnUNet [16] 91.61 90.24 89.24 95.36
nnFormer [35] 92.06 90.94 89.58 95.65
SAM3D [4] 90.41 89.44 87.12 94.67
Self-Prompt-SAM (Ours) 93.26 92.20 91.22 96.36
Table 3: Quantitative evaluation on ACDC (dice score in %).

Results on ACDC. In Table 3, we compare Self-Prompt-SAM with several leading SAM-based method (i.e. SAM3D [4]), convolution-based method (i.e., nnUNet [16]) and transformer-based method (i.e., nnFormer [35]). The results show that Self-Prompt-SAM outperforms various state-of-the-art approaches, surpassing SAM3D, nnUNet, nnFormer by 2.8%percent2.82.8\%2.8 %, 1.2%percent1.21.2\%1.2 %, and 1.6%percent1.61.6\%1.6 % in DSC, respectively. Fig. 3 shows that Self-Prompt-SAM can predict more accurately on all labels. The results demonstrate the effectiveness of our method since our proposed modules can properly solve the drawbacks of SAM when adapting to medical segmentation.

3.2 Ablation Study

Baseline Models. The proposed Self-Prompt-SAM has 9 baselines, as shown in Table 4. All baselines adopt the whole structure of SAM and only add blocks. (i) S1 adopts a series of stacked CNNs for the prompt generator combined with the image encoder. (ii) S2 utilizes our proposed MSPGenerator to combine with the image encoder to generate prompts. (iii) S3 adds the vanilla adapter [13] with each transformer block in the image encoder and the mask decoder based on S2. (iv) S4 adds the modified adapter by inserting the invert-bottleneck depth MLPs before the adapter with a skip connection based on S2. (v) S5 adds the modified adapter by inserting the invert-bottleneck depth MLPs with a skip connection after the adapter based on S2. (vi) S6 adds our DFusedAdapter with each transformer block in the image encoder and mask decoder based on S2. (vii) S7 adds depth positional embedding (DPosEmbed) in the image encoder and the mask encoder based on S6. (viii) S8 adds an MAdapter before the image encoder based on S7. (ix) S9 is our full model in Fig. 2, which adds an MC-Adapter to adapt binary segmentation to multi-classes segmentation based on S8.

Method DSC \uparrow
S1 SAM + stacked CNNs prompt generator 79.57
S2 SAM + MSPGenerator 82.20
S3 SAM + MSPGenerator + vAdapter 90.08
S4 SAM + MSPGenerator + vAdapter w/ Depth MLPs before vApdater 91.45
S5 SAM + MSPGenerator + vAdapter w/ Depth MLPs after vApdater 91.52
S6 SAM + MSPGenerator + DFusedAdapter 91.73
S7 SAM + MSPGenerator + DFusedAdapter + DPosEmbed 91.88
S8 SAM + MSPGenerator + DFusedAdapter + DPosEmbed + MAdapter 92.20
S9 Our Full Model (S8 + MC-Adapter) 93.26
Table 4: Ablation studies. vAdapter means vanilla adapters. DPosEmbed means depth positional embedding.

Ablation analysis. The results are shown in Table 4 on ACDC. When we use an MSPGenerator, the DSC of S2 improves by 2.7% compared to S1 consisting of stacked convolution layers. The result confirms the effectiveness of the proposed MSPGenerator. After inserting vanilla adapters into each transformer block, the performance of S3 is greatly improved 8% compared to S2 without adapters, demonstrating that using adapters is feasible to fine-tune SAM to medical image segmentation. Meanwhile, we found that the performance of S4 and S5 is very close when we insert the depth MLPs with a skip connection before or after the vanilla adapter. But the DFuserAdapter achieves the best performance compared to S4 and S5. Moreover, S6 improves by 1.7% compared to S3. The result confirms the effectiveness of DFusedAdapter. When involving depth positional embeddings into the image encoder and mask decoder, the performance of S7 improves by more than 0.1% compared to S6 without any depth positional embedding, demonstrating the effectiveness of the DPosEmbed. When we adopt the MAdapter before the image encoder, the average DSC of S8 improves by 0.4% compared to S7, which confirms the benefits of the MAdapter. Compared to S8, S9 (our full model, Self-Prompt-SAM) brings 1% improvements. Therefore, the results demonstrate the effectiveness of Self-Prompt-SAM.

4 Conclusion

We introduce Self-Prompt-SAM, a framework for adapting pre-trained SAM models from 2D natural images to 3D medical images without manual prompts. Our method employs a multi-scale prompt generator (MSPGenerator) to generate prompts autonomously. These prompts are utilized for bounding boxes and point selection using Distance Transform. We integrate a 3D depth-fused adapter (DfusedAdapter) into the image encoder and mask decoder to enable pre-trained 2D SAM models to process 3D medical images. Extensive experiments show that our method achieves state-of-the-art performance, surpassing nnUNet by 2.3% on AMOS22, 1.6% on ACDC, and 0.5% on Synapse datasets.

References

  • [1] Akkus, Z., Galimzianova, A., Hoogi, A., Rubin, D.L., Erickson, B.J.: Deep learning for brain mri segmentation: state of the art and future directions. Journal of digital imaging 30(4), 449–459 (2017)
  • [2] Bernard, O., Lalande, A., Zotti, C., Cervenansky, F., Yang, X., Heng, P.A., Cetin, I., Lekadir, K., Camara, O., Ballester, M.A.G., et al.: Deep learning techniques for automatic mri cardiac multi-structures segmentation and diagnosis: Is the problem solved? IEEE TMI (2018)
  • [3] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. NeurIPS (2020)
  • [4] Bui, N.T., Hoang, D.H., Tran, M.T., Le, N.: Sam3d: Segment anything model in volumetric medical images. arXiv preprint arXiv:2309.03493 (2023)
  • [5] Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-unet: Unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537 (2021)
  • [6] Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306 (2021)
  • [7] Deng, R., Cui, C., Liu, Q., Yao, T., Remedios, L.W., Bao, S., Landman, B.A., Wheless, L.E., Coburn, L.A., Wilson, K.T., et al.: Segment anything model (sam) for digital pathology: Assess zero-shot segmentation on whole slide imaging. arXiv preprint arXiv:2304.04155 (2023)
  • [8] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
  • [9] Gong, S., Zhong, Y., Ma, W., Li, J., Wang, Z., Zhang, J., Heng, P.A., Dou, Q.: 3dsam-adapter: Holistic adaptation of sam from 2d to 3d for promptable medical image segmentation. arXiv preprint arXiv:2306.13465 (2023)
  • [10] Hatamizadeh, A., Nath, V., Tang, Y., Yang, D., Roth, H.R., Xu, D.: Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: International MICCAI Brainlesion Workshop (2021)
  • [11] Hatamizadeh, A., Tang, Y., Nath, V., Yang, D., Myronenko, A., Landman, B., Roth, H.R., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In: WACV (2022)
  • [12] He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
  • [13] Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Gesmundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for nlp. In: International Conference on Machine Learning. pp. 2790–2799. PMLR (2019)
  • [14] Hu, C., Li, X.: When sam meets medical images: An investigation of segment anything model (sam) on multi-phase liver tumor segmentation. arXiv preprint arXiv:2304.08506 (2023)
  • [15] Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)
  • [16] Isensee, F., Jäger, P.F., Kohl, S.A., Petersen, J., Maier-Hein, K.H.: Automated design of deep learning methods for biomedical image segmentation. arXiv preprint arXiv:1904.08128 (2019)
  • [17] Ji, G.P., Fan, D.P., Xu, P., Cheng, M.M., Zhou, B., Van Gool, L.: Sam struggles in concealed scenes–empirical study on” segment anything”. arXiv preprint arXiv:2304.06022 (2023)
  • [18] Ji, W., Li, J., Bi, Q., Li, W., Cheng, L.: Segment anything is not always perfect: An investigation of sam on different real-world applications. arXiv preprint arXiv:2304.05750 (2023)
  • [19] Ji, Y., Bai, H., Ge, C., Yang, J., Zhu, Y., Zhang, R., Li, Z., Zhanng, L., Ma, W., Wan, X., et al.: Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation. NeurIPS (2022)
  • [20] Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
  • [21] Landman, B., Xu, Z., Igelsias, J.E., Styner, M., Langerak, T., Klein, A.: Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. In: Proc. MICCAI: Multi-Atlas Labeling Beyond Cranial Vault-Workshop Challenge (2015)
  • [22] Li, C., Khanduri, P., Qiang, Y., Sultan, R.I., Chetty, I., Zhu, D.: Auto-prompting sam for mobile friendly 3d medical image segmentation. arXiv preprint arXiv:2308.14936 (2023)
  • [23] Ma, J., Wang, B.: Segment anything in medical images. arXiv preprint arXiv:2304.12306 (2023)
  • [24] Mohapatra, S., Gosai, A., Schlaug, G.: Sam vs bet: A comparative study for brain extraction and segmentation of magnetic resonance images using deep learning. arXiv preprint arXiv:2304.04738 2,  4 (2023)
  • [25] OpenAI: GPT-4 technical report (2023)
  • [26] Pan, J., Lin, Z., Zhu, X., Shao, J., Li, H.: St-adapter: Parameter-efficient image-to-video transfer learning. NeurIPS (2022)
  • [27] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention (2015)
  • [28] Wang, W., Chen, C., Ding, M., Yu, H., Zha, S., Li, J.: Transbts: Multimodal brain tumor segmentation using transformer. In: MICCAI (2021)
  • [29] Wu, J., Fu, R., Fang, H., Liu, Y., Wang, Z., Xu, Y., Jin, Y., Arbel, T.: Medical sam adapter: Adapting segment anything model for medical image segmentation. arXiv preprint arXiv:2304.12620 (2023)
  • [30] Xie, B., Tang, H., Cai, D., Yan, Y.: Ms-umlp: Medical image segmentation via multi-scale u-shape mlp-mixer. In: Proceedings of the Asian Conference on Computer Vision. pp. 1793–1808 (2024)
  • [31] Xie, B., Tang, H., Duan, B., Cai, D., Yan, Y.: Masksam: Towards auto-prompt sam with mask classification for medical image segmentation. arXiv preprint arXiv:2403.14103 (2024)
  • [32] Xu, G., Wu, X., Zhang, X., He, X.: Levit-unet: Make faster encoders with transformer for medical image segmentation. arXiv preprint arXiv:2107.08623 (2021)
  • [33] Yang, T., Zhu, Y., Xie, Y., Zhang, A., Chen, C., Li, M.: Aim: Adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024 (2023)
  • [34] Zhang, K., Liu, D.: Customized segment anything model for medical image segmentation. arXiv preprint arXiv:2304.13785 (2023)
  • [35] Zhou, H.Y., Guo, J., Zhang, Y., Yu, L., Wang, L., Yu, Y.: nnformer: Interleaved transformer for volumetric segmentation. arXiv preprint arXiv:2109.03201 (2021)
  • [36] Zhou, T., Zhang, Y., Zhou, Y., Wu, Y., Gong, C.: Can sam segment polyps? arXiv preprint arXiv:2304.07583 (2023)

Appendix 0.A Rethinking SAM.

SAM is the first prompt-driven foundation model for natural image segmentation, which is trained on the large-scale SA-1B dataset of 1B masks and 11M images, allowing the model to have a strong zero-shot generalization. SAM consists of three main components, the image encoder that employs the Vision Transformer as the backbone to extract image features, the prompt encoder that embeds various types of prompts, including points, boxes, or texts, and the lightweight mask decoder to generate masks based on the image embedding, prompt embedding, image positional embedding, and output token.

There are some issues with adapting SAM to medical segmentation tasks. i) SAM needs users to provide appropriate prompts, such as points or boxes, to locate the target regions. We cannot expect users to have a medical background; therefore, we designed an MSPGenerator to handle this issue. We introduce it in the following section. ii) SAM does not provide any semantic information for segmentation results since it only generates a binary mask for one prompt. we utilize the MSPGenerator to generate auxiliary multi-class masks, which can be encoded to one-hot binary masks whose number channel dimension is the same number of classes. Therefore, each channel-wise binary mask is specifically responsible for a certain semantic label by the location or index of the channel dimension in the one-hot binary masks. Sometimes, channel-wise binary masks may not have any foreground. In this case, we assign the values of both a box prompt and a point prompt to zero. iii) The performance of SAM in medical image segmentation usually does not meet the strict requirements in clinical medicine. Therefore, we designed DFusedAdapter, inserted learnable positional embeddings into the image encoder and mask decoder, and designed other blocks for special functionalities to improve the performance to fine-tune.

Appendix 0.B Theoretical comparison with SAM-based models.

The main contributions of our Self-Prompt-SAM different from the existing SAM-based models are i) automatic prompt, ii) to provide semantic labels for each mask, and iii) remain all parameters of the original SAM for zero-shot capabilities. There are several categories of the existing SAM-based models. The first category does not modify SAM, such as MedSAM and Polyp-SAM. These models need manual prompts, such as points or boxes, and cannot classify masks into semantic labels. The second category uses parameter-efficient transfer learning, such as Adapters, into SAM. The popular model, Med-SA, uses the GT to generate prompts during inference, which do not have any practical clinical values. It also includes the non-automatic models of the 3DSAM-Adapter and MA-SAM. These models do not handle the requirements of extra prompts. The third category is that cannot provide semantic labels to binary masks, such as DeSAM, Med-SA, and MA-SAM. Since SAM only predicts binary masks, these models do not address the lack of representation of semantic labels. The fourth category is abandoning the components of SAM, such as Mask Decoder, to handle the inability to classify semantic labels, such as 3DSAM-Adapter. This way inevitably destroys the consistency and zero-shot capabilities of SAM. These models only use the pre-trained ViT encoder, which is not the contribution of SAM.

Appendix 0.C More details for DFusedAdapter

The DFusedAdapter can be expressed as DFusedAdapter(X)=DFusedAdapterXabsent\text{DFusedAdapter}(\textbf{{X}})=DFusedAdapter ( X ) =

X+(σ(XWdn)+σ(σ(XWdn)WDup)WDdn)Wup,X𝜎XsubscriptW𝑑𝑛𝜎𝜎XsubscriptW𝑑𝑛subscriptW𝐷𝑢𝑝subscriptW𝐷𝑑𝑛subscriptW𝑢𝑝\textbf{{X}}+(\sigma(\textbf{{X}}\cdot\textbf{{W}}_{dn})+\sigma(\sigma(\textbf% {{X}}\cdot\textbf{{W}}_{dn})\cdot\textbf{{W}}_{Dup})\cdot\textbf{{W}}_{Ddn})% \cdot\textbf{{W}}_{up},X + ( italic_σ ( X ⋅ W start_POSTSUBSCRIPT italic_d italic_n end_POSTSUBSCRIPT ) + italic_σ ( italic_σ ( X ⋅ W start_POSTSUBSCRIPT italic_d italic_n end_POSTSUBSCRIPT ) ⋅ W start_POSTSUBSCRIPT italic_D italic_u italic_p end_POSTSUBSCRIPT ) ⋅ W start_POSTSUBSCRIPT italic_D italic_d italic_n end_POSTSUBSCRIPT ) ⋅ W start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT , (1)

where σ𝜎\sigmaitalic_σ denotes the activation function, WdnC×C4subscriptW𝑑𝑛superscript𝐶𝐶4\textbf{{W}}_{dn}{\in}\mathbb{R}^{C\times\frac{C}{4}}W start_POSTSUBSCRIPT italic_d italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_C × divide start_ARG italic_C end_ARG start_ARG 4 end_ARG end_POSTSUPERSCRIPT and WupC4×CsubscriptW𝑢𝑝superscript𝐶4𝐶\textbf{{W}}_{up}{\in}\mathbb{R}^{\frac{C}{4}\times C}W start_POSTSUBSCRIPT italic_u italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT divide start_ARG italic_C end_ARG start_ARG 4 end_ARG × italic_C end_POSTSUPERSCRIPT denote the linear down- and up-projection layer processing on the channel dimension respectively, WDupD×4DsubscriptW𝐷𝑢𝑝superscript𝐷4𝐷\textbf{{W}}_{Dup}{\in}\mathbb{R}^{D\times 4D}W start_POSTSUBSCRIPT italic_D italic_u italic_p end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × 4 italic_D end_POSTSUPERSCRIPT and WDdn4D×DsubscriptW𝐷𝑑𝑛superscript4𝐷𝐷\textbf{{W}}_{Ddn}{\in}\mathbb{R}^{{4D}\times D}W start_POSTSUBSCRIPT italic_D italic_d italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 4 italic_D × italic_D end_POSTSUPERSCRIPT denote the linear up- and down-projection layer processing on the depth dimension respectively. In this way, our model can learn the extra-depth information.

Appendix 0.D Implementation Details

We utilize some data augmentations such as rotation, scaling, Gaussian noise, Gaussian blur, brightness, and contrast adjustment, simulation of low resolution, gamma augmentation, and mirroring. We set the initial learning rate to 0.01 and employ a “poly” decay strategy in Eq. (2).

lr(e)=init_lr×(1eMAX_EPOCH)0.9,𝑙𝑟𝑒𝑖𝑛𝑖𝑡_𝑙𝑟superscript1𝑒MAX_EPOCH0.9lr(e)=init\_lr\times(1-\frac{e}{\rm MAX\_EPOCH})^{0.9},italic_l italic_r ( italic_e ) = italic_i italic_n italic_i italic_t _ italic_l italic_r × ( 1 - divide start_ARG italic_e end_ARG start_ARG roman_MAX _ roman_EPOCH end_ARG ) start_POSTSUPERSCRIPT 0.9 end_POSTSUPERSCRIPT , (2)

where e𝑒eitalic_e means the number of epochs, MAX_EPOCH means the maximum of epochs, set it to 1000 and each epoch includes 250 iterations. We utilize SGD as our optimizer and set the momentum to 0.99. The weighted decay is set to 3e-5. We utilize both cross-entropy loss and dice loss by simply summing them up as the loss function. We utilize instance normalization as our normalization layer. Since we expect relatively good auxiliary masks to finetune the mask decoder, only the deep supervision loss of MSPGenerator is trained in the first two hundred epochs. After two hundred epochs, our model combines the deep supervision loss of MSPGenerator and the loss of MC-Adapter at the end of the mask decoder. All experiments are conducted using two NVIDIA RTX A6000 GPUs with 40GB memory.

Deep Supervision. Our network is trained with deep supervision when training for the auxiliary losses. Auxiliary losses are added in the decoder. For each deep supervision output, we downsample the ground truth segmentation mask for the loss computation with each deep supervision output. The final training objective is the sum of all resolutions loss:

=w11+w22+w33+wnnsubscript𝑤1subscript1subscript𝑤2subscript2subscript𝑤3subscript3subscript𝑤𝑛subscript𝑛\displaystyle\mathcal{L}=w_{1}\cdot\mathcal{L}_{1}+w_{2}\cdot\mathcal{L}_{2}+w% _{3}\cdot\mathcal{L}_{3}+\cdot\cdot\cdot w_{n}\cdot\mathcal{L}_{n}caligraphic_L = italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + ⋯ italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT (3)

where the weights halve with each decrease in resolution (i.e., w2=12w1;w3=14w1formulae-sequencesubscript𝑤212subscript𝑤1subscript𝑤314subscript𝑤1w_{2}=\frac{1}{2}\cdot w_{1};w_{3}=\frac{1}{4}\cdot w_{1}italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ⋅ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_w start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 4 end_ARG ⋅ italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, etc), and all weight are normalized to sum to 1. Meanwhile, the resolution of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is equal to 222subscript22\cdot\mathcal{L}_{2}2 ⋅ caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 434subscript34\cdot\mathcal{L}_{3}4 ⋅ caligraphic_L start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT.