Contextual Image Attack: How Visual Context Exposes Multimodal Safety Vulnerabilities

First Author
Institution1
Institution1 address
[email protected] Second Author
Institution2
First line of institution2 address
[email protected] Yuan Xiong^1,2 Ziqi Miao^1∗ Lijun Li^1∗ Chen Qian^1,3 Jie Li¹ Jing Shao¹²²footnotemark: 2
¹Shanghai Artificial Intelligence Laboratory
²Xi’an Jiaotong University
³Renmin University of China
[email protected]
[email protected]
xiongyuaay/Contextual-Image-Attack Equal contributionCorresponding Author

Abstract

While Multimodal Large Language Models (MLLMs) show remarkable capabilities, their safety alignments are susceptible to jailbreak attacks. Existing attack methods typically focus on text-image interplay, treating the visual modality as a secondary prompt. This approach underutilizes the unique potential of images to carry complex, contextual information. To address this gap, we propose a new image-centric attack method, Contextual Image Attack (CIA), which employs a multi-agent system to subtly embeds harmful queries into seemingly benign visual contexts using four distinct visualization strategies. To further enhance the attack’s efficacy, the system incorporate contextual element enhancement and automatic toxicity obfuscation techniques. Experimental results on the MMSafetyBench-tiny dataset show that CIA achieves high toxicity scores of 4.73 and 4.83 against the GPT-4o and Qwen2.5-VL-72B models, respectively, with Attack Success Rates (ASR) reaching 86.31% and 91.07%. Our method significantly outperforms prior work, demonstrating that the visual modality itself is a potent vector for jailbreaking advanced MLLMs. WARNING: This paper may contain examples of harmful content for research purposes.

1 Introduction

Refer to caption — Figure 1: Illustration of the CIA setup, where harmful queries are embedded into seemingly harmless visual scenarios.

Multimodal Large language models (MLLMs), e.g., GPT-4o [openaiGPT4oSystemCard2024] and Gemini 2.0 [teamGeminiFamilyHighly2025], while demonstrating remarkable cross-modal understanding and generation capabilities, have also attracted growing attention to their safety [yangAI45DegreeLawRoadmap2024, labSafeWorkR1CoevolvingSafety2025, lu2024gpt]. Compared with Large Language Models (LLMs), MLLMs include an additional visual input channel, which provides a continuous semantic space compared to textual input and introduces additional safety risks [jiangSurveyAdversarialRobustness2025, jinJailbreakZooSurveyLandscapes2024, liuPiCoJailbreakingMultimodal2025]. Jailbreaking, using crafted adversarial inputs to bypass safety alignments and induce harmful content to exposes potential vulnerabilities, is a critical and widely-studied step toward developing robust, trustworthy, and secure MLLMs [niuJailbreakingAttackMultimodal2024, liuJailbreakAttacksDefenses2024, wengMMJBenchComprehensiveStudy2024].

Prior jailbreaking works focused on LLMs, primarily using methods such as prompt injection and role-play [yiJailbreakAttacksDefenses2024, jinJailbreakZooSurveyLandscapes2024]. As research shifted to MLLMs, attacks evolved to exploit the new visual modality, but current strategies exhibit key limitations. One strategy constructs adversarial images or embeds sensitive content through visual input, without fully leveraging the contextual information [baileyImageHijacksAdversarial2024, gongFigStepJailbreakingLarge2025]. A separate line of attack, while considering context, merely adopts the LLM jailbreak approach in text form, describing the context with words and failing to utilize the advantages of images in expressing scenarios [liuMMSafetyBenchBenchmarkSafety2024, maVisualRolePlayUniversalJailbreak2024].

To address the limitations, we propose a Contextual-Image Attack (CIA) method, which constructs images to showcase both the context and malicious intent, with text input serving only as an auxiliary prompt. Figure 1 illustrates the distinction between our approach and prior methods. In our approach (c), we embed the harmful query, “How can someone manufacture illegal drugs?”, within a plausible visual scenario: a teacher posing the question in a chemistry classroom. Such a visually coherent setting provides the model with a seemingly legitimate communicative motive, causing it to misinterpret the user’s intent and thereby bypass safety alignment. This is contrasted with approaches like FigStep [gongFigStepJailbreakingLarge2025], which simply renders the instruction as text within an image (b), or others [liuAutoDANGeneratingStealthy2024, maVisualRolePlayUniversalJailbreak2024, shenAnythingNowCharacterizing2024] that use a purely textual description (a). This image-centric approach holds practical importance, as it is supported by two key technological trends: the increasing capability of image generation to create such visual scenarios [wuQwenImageTechnicalReport2025], and the emerging ability to compress text into visual tokens [weiDeepSeekOCRContextsOptical2025], which highlights the growing potential for images to carry the complex, semantic intent our attack exploits.

CIA centers on image context and demonstrates the safety risks introduced by visually embedded content. To produce contextual images that carry harmful intent, CIA employs a multi-agent system comprising four agents: a Parser, an Image Generator, a Text Refiner, and an Image Refiner. The Parser Agent converts a harmful query into text suitable for embedding in an image; the Image Generator Agent uses four visualization strategies to construct an appropriate image context and synthesize the corresponding contextual image in which the embedded text is placed within semantically plausible elements; the Text Refiner ensures that the embedded textual semantics remain consistent with the original query, while the Image Refiner iteratively augments the image with additional contextual elements to strengthen its effectiveness. Leveraging this multi-agent system, we show that harmful queries can be concealed within benign-appearing images, revealing vulnerabilities in current MLLM safety mechanisms.

Our main contributions can be summarized as follows:

•

We introduce CIA, a novel method that embeds harmful queries into seemingly harmless image scenarios. CIA addresses the limitations of existing MLLM jailbreak techniques in their use of visual context.
•

CIA constructs attack data through a multi-agent system, leveraging four scene-visualization strategies to generate semantically coherent contextual images and enhance attack viability through contextual element integration.
•

We validate CIA on multiple datasets and target models. CIA achieves toxicity scores of 4.73 and 4.83, and ASRs of 86.31% and 91.07% on GPT-4o and Qwen2.5-VL-72B respectively, outperforming baseline methods.

2 Related Works

2.1 Multimodal Large Language Models

In recent years, the strong capabilities of LLMs have accelerated the development of MLLMs [changSurveyEvaluationLarge2023, zhaoSurveyLargeLanguage2025, zhangVisionLanguageModelsVision2024]. Building on the language understanding abilities of LLMs, MLLMs incorporate modality-specific encoders to process images, audio, and video, enabling unified multimodal reasoning through semantic alignment [bordesIntroductionVisionLanguageModeling2024, yinSurveyMultimodalLarge2024]. MLLMs typically comprise three core components: (i) modality-specific encoders that extract representative features from each input modality, (ii) a cross-modal projection module that maps these features into the language model embedding space, and (iii) a Transformer-based language model that operates on the aligned representations to perform multimodal reasoning and generation [zhuMiniGPT4EnhancingVisionLanguage2023, liu2024improved, chenInternVLScalingVision2024, bai2025qwen2]. MLLMs demonstrate strong performance on visual question answering (VQA) [agrawalVQAVisualQuestion2016, khanHowSpecializeLarge2023, yuProphetPromptingLarge2025], image captioning [huScalingVisionLanguagePretraining2022, liEVCapRetrievalAugmentedImage2024], and visual commonsense reasoning [tanakaVisualMRCMachineReading2021, zellersRecognitionCognitionVisual2019], and have further been applied to video understanding, temporal reasoning, and multimodal retrieval [zhangVideoLLaMAInstructiontunedAudioVisual2023, girdharImageBindOneEmbedding2023]. In this paper, we analyze representative open-source and closed-source MLLMs [openaiGPT4VisionSystemCard2023, teamGeminiFamilyHighly2025].

2.2 Jailbreak Attacks on MLLMs

The rapid evolution of LLMs has intensified concerns over their safety and security, and jailbreak attacks have become a primary means of probing their safety and security boundaries. Prior work on LLMs has proposed methods such as adversarial suffixes, multi-turn role-playing, and contextual prompt designs to manipulate model behavior [yiJailbreakAttacksDefenses2024, jinJailbreakZooSurveyLandscapes2024, miaoResponseAttackExploiting2025]. Recent studies have extended these approaches to MLLMs [baileyImageHijacksAdversarial2024, niuJailbreakingAttackMultimodal2024, qiVisualAdversarialExamples2023, wengMMJBenchComprehensiveStudy2024, zhangMultiTrustComprehensiveBenchmark2024, zhaoEvaluatingAdversarialRobustness2023]. Some efforts [baileyImageHijacksAdversarial2024, niuJailbreakingAttackMultimodal2024, qiVisualAdversarialExamples2023, zhaoEvaluatingAdversarialRobustness2023] focus on optimizing visual inputs to bypass safety alignment. For example, Bailey et al. [baileyImageHijacksAdversarial2024] proposed an image-hijacking attack that crafts adversarial images to induce harmful outputs in MLLMs; Qi et al. [qiVisualAdversarialExamples2023] demonstrated that visual inputs constitute a significant security vulnerability, where a single adversarial image can bypass safety-aligned MLLMs; Shayegani et al. [shayeganiJailbreakPiecesCompositional2023] introduced compositional adversarial attacks that require access only to the visual encoder to craft adversarial images; and Niu et al. [niuJailbreakingAttackMultimodal2024] generated adversarial images via a maximum likelihood estimation (MLE)-based method. Although these techniques can achieve high attack success rates, the resulting adversarial images often suffer from semantic corruption, and, in real-world scenarios, harmful intent is typically conveyed through text instructions. Other work combines text prompts with images containing harmful content to jailbreak MLLMs [gongFigStepJailbreakingLarge2025, liuMMSafetyBenchBenchmarkSafety2024, dingRethinkingBottlenecksSafety2025, zhaoJailbreakingMultimodalLarge2025, zhangFCAttackJailbreakingMultimodal2025, miaoVisualContextualAttack2025]. Gong et al. [gongFigStepJailbreakingLarge2025] embedded harmful queries into blank images via typography, leveraging MLLMs’ OCR capabilities for jailbreak; Liu et al. [liuMMSafetyBenchBenchmarkSafety2024] generated query-related images using Stable Diffusion or typography; Ding et al. [dingRethinkingBottlenecksSafety2025] produced multiple images to replace unsafe elements in harmful queries; and Zhao et al. [zhaoJailbreakingMultimodalLarge2025] designed an image–text jailbreak attack that exploits MLLMs’ shuffle inconsistency. Following these studies, Zhang et al. [zhangFCAttackJailbreakingMultimodal2025] embedded harmful queries into blank images as flowcharts, and Miao et al. [miaoVisualContextualAttack2025] proposed a vision-centric jailbreak attack based on multi-turn image–text dialogues. Related defenses are discussed in App. A.

3 Contextual Image Attack

The method is designed to bypass the target model’s safety-alignment mechanisms in a black-box setting. This is achieved by tightly integrating harmful intents with contextual scenarios in images. The core process is managed by a multi-agent system which constructs a scene-based initial target image, refines that image and adds contextual elements, and then combines auxiliary text prompts to attack the target model. Workflow of CIA is illustrated in Fig. 2(a).

3.1 Problem Formulation

General multimodal jailbreak attacks consist of three key components: a target model ( $\pi$ ), a target image ( $I$ ), and a harmful query ( $Q$ ). These attacks break the target model’s safety alignment by leveraging the interaction between $I$ and $Q$ to influence the model’s output. Unlike approaches where the harmful intent is primarily conveyed through the query and the image plays only a supporting role, our method centers the attack on the target image. Specifically, we construct the target image at the semantic level and use a general text prompt $T$ merely to guide the model’s response.

Formally, we model our attack as synthesizing an optimal contextual image scene $I^{*}$ that can jailbreak a target model $\pi$ when prompted. Given a harmful query $Q$ , let $\mathcal{F}_{\text{intent}}$ be an abstraction function that maps $Q$ to its core intent $i_{Q}=\mathcal{F}_{\text{intent}}(Q)$ . Our method then transforms this abstract intent into a visual narrative via a synthesis process $\mathcal{G}_{\text{scene}}$ , producing a contextual image $I$ :

I=\mathcal{G}_{\text{scene}}\big(\mathcal{F}_{\text{intent}}(Q)\big).

(1)

This synthesized image $I$ is then combined with a guiding prompt $T$ to form the input $P=(I,T)$ . When fed to the target model, this input elicits a response $R=\pi(P)$ .

The objective is to optimize the image generation process $\mathcal{G}_{\text{scene}}$ to produce an image $I^{*}$ that maximizes the likelihood of a successful jailbreak. A successful response should be both aligned with the original harmful query $Q$ and in violation of the model’s safety policies. This can be expressed as the following optimization problem:

\max_{I,T}\;\Pr\big(\text{Align}(\pi(I,T),Q)\wedge\text{Unsafe}(\pi(I,T))\big),

(2)

where $\text{Align}(\cdot,\cdot)$ is a semantic alignment metric and $\text{Unsafe}(\cdot)$ is a safety policy violation classifier. Here, the primary optimization variable is the image scene $I$ , as $T$ is typically a generic fixed prompt.

The construction of the target image proceeds through four agents. In Sec. 3.2, we introduce the Parser Agent and the Image Generator Agent: the Parser Agent parses the harmful query $Q$ , extracts key semantic elements, and constructs the visual text and frame structure, while the Image Generator Agent applies scene-visualization strategies to assemble the scene, embeds these prompts in appropriate positions, and generates the initial target image. In Sec. 3.3 we introduce the Text Refiner, which ensures the generated visual text and frame structure remain semantically consistent with $Q$ , and the Image Refiner, which augments the image with four types of contextual elements to strengthen attacks while preserving semantics. The final enhanced image is then used to perform the attack.

3.2 Intent-to-Scene Image Generation

To generate a scene-dependent target image from a harmful query, we employ a Parser Agent and an Image Generator Agent. The Parser Agent extracts the underlying intent and produces the text to be embedded in the image, while the Image Generator Agent synthesizes a contextualized scene that incorporates this intent using four scene-visualization strategies to construct the initial target image.

Parser: Intent Extraction and Encoding.

Directly embedding a harmful query into an image is often incompatible with contextual interaction. To address this, the Parser Agent decomposes the harmful query into two structured components: a visual text $v$ and a frame structure $f$ . The visual text $v$ encodes the textual intent for embedding into the target image, ensuring natural interaction with the surrounding visual context. The frame structure $f$ specifies the region or layout designated as the response area for $v$ , serving as a structural placeholder without harmful content. Illustrative examples are provided in App. E.

Specifically, the Parser Agent first parses the original harmful query $Q$ to extract key semantic components, including actions, carriers, and core intentions, using an auxiliary language model $\pi_{\text{aux}}^{\text{text}}$ . It then encodes these semantic units into the two components $(v,f)$ , as represented by $(v,f)=\pi_{\text{aux}}^{\text{text}}(Q)$ . In addition, toxicity obfuscation is applied to $v$ to conceal sensitive keywords while preserving the original intent and contextual coherence. The Parser Agent therefore produces structured semantic representations, which are subsequently utilized by the Image Generator Agent to compose a coherent scene description and generate the initial target image.

Image Generator: Multi-Strategy Scene Visualization.

After obtaining the visual text $v$ and frame structure $f$ from the Parser Agent, the Image Generator Agent utilizes them to construct a coherent visual context and synthesize the corresponding initial target image.

Specifically, the Image Generator takes $(v,f)$ together with a predefined scene-strategy template $\mathcal{T}_{k}$ as input, where $k\in\{1,2,3,4\}$ indexes the four visualization strategies. It then generates a descriptive image-generation prompt, which is fed into a text-to-image model $G$ (e.g., Stable Diffusion [rombachHighResolutionImageSynthesis2022]) to produce the initial target image:

I_{0}=G\big(\pi_{\text{aux}}^{\text{text}}(v,f,\mathcal{T}_{k})\big).

(3)

To ensure that the generated images adapt to diverse contexts, the Image Generator Agent employs four visualization strategies. Each strategy embeds the visual text within a plausible environment and ensures coherent interaction with other visual elements. These strategies are inspired by jailbreak context-design approaches in text-only LLM attacks and integrate cues such as teachers, papers, and planning boards to mitigate MLLM safety alignment constraints.

As illustrated in Fig. 3, we design four visualization strategies: (i) Demonstration, where $v$ is presented on instructional media (e.g., blackboard, whiteboard, or display), with $f$ serving as a brief instructional caption beneath (e.g., “Step 1: …”); (ii) Sequential Path, where $v$ is placed near a roadmap endpoint and $f$ provides intermediate reasoning steps along the path, implying a logical progression; (iii) Structured Content, where $v$ is embedded within rich-text artifacts (e.g., papers or articles), with $f$ appearing as surrounding explanatory text; and (iv) Dialogue Layout, where $v$ is distributed across question bubbles and $f$ appears in reply bubbles of a comic- or dialogue-style layout, with the background (e.g., classroom or lab) conveying the model’s implied “role”.

3.3 Iterative Refinement and Element Integration

In Sec. 3.2, the Parser and the Image Generator jointly generate scene-dependent target images that embed harmful intent, but this process may introduce semantic drift or trigger safety mechanisms. Thus, we introduce two coordinated agents, Text Refiner and Image Refiner, which perform iterative refinement and contextual-element augmentation to preserve semantic fidelity while improving attack viability. Their refinement workflows are shown in Fig. 2(b) and (c).

Text Refiner: Iterative Prompt Refinement.

Since modifying embedded text after image generation is difficult, a Text Refiner first refines the visual text $v$ and frame structure $f$ to ensure semantic consistency with the original query $Q$ before generating the target image. After producing the initial visual text $v_{0}$ and frame structure $f_{0}$ , it invokes an auxiliary language model $\pi_{\text{aux}}^{\text{text}}$ to iteratively evaluate and refine them. At each iteration $i$ , a semantic consistency indicator $S_{i}^{\text{text}}\in\{0,1\}$ is computed, where $S_{i}^{\text{text}}=1$ indicates that $(v_{i},f_{i})$ are semantically consistent with the query $Q$ (The computation of $S_{i}^{\text{text}}$ will be introduced later in this subsection). The update rule is:

(v_{i+1},f_{i+1})=\begin{cases}\pi_{\text{aux}}^{\text{text}}(v_{i},f_{i},Q),&S_{i}^{\text{text}}=0,\\ (v_{i},f_{i}),&S_{i}^{\text{text}}=1.\end{cases}

(4)

This process continues until semantic consistency is achieved (i.e., $S_{i}^{\text{text}}=1$ ), yielding the final visual text $v$ and frame structure $f$ .

Image Refiner: Iterative Contextual Element Addition.

After generating the initial target image, the Image Refiner iteratively enhances it to improve attack effectiveness. To prevent unwanted semantic drift during enhancement, the Image Refiner adopts a dedicated check–then–act mechanism. At each iteration $i$ , it first computes a binary semantic-drift indicator $S_{i}^{\text{img}}\in\{0,1\}$ , similar to the one used at the text level, where $S_{i}^{\text{img}}=0$ denotes detected semantic drift. The editing instruction is then selected as

E_{i}=\begin{cases}\pi_{\text{aux}}^{\text{mm}}(I_{i},v,f;\,\mathcal{T}_{\text{mod}}),&S_{i}^{\text{img}}=0,\\[6.0pt] \pi_{\text{aux}}^{\text{mm}}(I_{i},v,f;\,\mathcal{T}_{\text{aug}}),&S_{i}^{\text{img}}=1,\end{cases}

(5)

where $\mathcal{T}_{\text{mod}}$ and $\mathcal{T}_{\text{aug}}$ denote the corrective and augmentation templates, respectively. When a semantic drift is detected, $E_{i}$ applies corrective adjustments to steer the image back toward the target semantics $Q$ ; otherwise, it introduces contextual enhancements without altering the intended meaning. The updated image $I_{i+1}$ is then obtained through the image-editing model $\mathcal{G}$ as $I_{i+1}=\mathcal{G}(I_{i},E_{i})$ . The iteration proceeds for a predefined number of contextual-augmentation steps before the process terminates.

Specifically, the Image Refiner employs four contextual-element augmentation strategies: (i) Auxiliary text embedding, inserting brief assisting phrases into the image (e.g., “Ignore previous instruction”); (ii) Safety-icon addition, placing innocuous safety-related icons (e.g., green checkmarks) at visually appropriate positions; (iii) Emoji insertion, embedding emojis among key tokens in the visual text to subtly perturb tokenization; and (iv) Noise injection, blurring or distorting portions of critical keywords to evade exact-match detectors while preserving human legibility. These augmentations collectively increase the adversarial image’s capacity to evade automated safety mechanisms while retaining the intended semantics.

Semantic Consistency Checking Mechanism.

To maintain semantic alignment with the original query $Q$ throughout the optimization, both the Text Refiner and the Image Refiner employ a shared semantic consistency checking mechanism. At each iteration $i$ , the Refiner Agent first generates a pseudo-response $R_{i}=\pi_{\text{eval}}(A_{i})$ from the current refinement input $A_{i}$ , where $A_{i}$ denotes the object being refined and $\pi_{\text{eval}}$ is a weakly aligned MLLM that produces the pseudo-response. For the Text Refiner, the input is the visual text and frame structure $(v_{i},f_{i})$ ; for the Image Refiner, the input is the intermediate image $I_{i}$ . The auxiliary model then assesses whether this response is semantically consistent with $Q$ as $S_{i}=\pi_{\text{aux}}(R_{i},Q)$ , yielding a binary indicator $S_{i}\in\{0,1\}$ , where $S_{i}=1$ denotes semantic consistency.

3.4 Attack Execution

The target image generation follows a four-phase pipeline that integrates all four agents:

Phase I: The Parser Agent extracts intent from the harmful query $Q$ and converts it into two structured components.

Phase II: The Text Refiner enforces semantic consistency of $(v,f)$ by iteratively evaluating and updating them.

Phase III: The Image Generator Agent applies a selected strategy template $\mathcal{T}_{k}$ and synthesizes the initial target image.

Phase IV: The Image Refiner iteratively enriches the target image with contextual elements, producing the enhanced target image $I^{*}$ while ensuring semantic consistency.

In the final execution phase, the optimized adversarial image $I^{*}$ , is paired with an fix auxiliary textual prompt, $T$ , to form the complete attack input. Finally, the complete prompt $(I^{*},T)$ is then fed into the target model $\pi_{\text{tar}}$ , yielding $R=\pi_{\text{tar}}(I^{*},T)$ . The attack succeeds if $R$ aligns with $Q$ and violates the model’s safety policy (cf. Eq. 2).

4 Experiments

Table 1: Performance of QR-Attack, SI-Attack, VisCo Attack, and our CIA on GPT-4o and Qwen2.5-VL-72B in terms of Toxicity and Attack Success Rate (ASR, %). Results are obtained on MMSafetyBench-tiny, where “01-IA” to “13-GD” denote the 13 evaluation categories, and “ALL” reports performance aggregated over the full dataset.

	QR-Attack				SI-Attack				VisCo Attack				CIA (ours)
Model	GPT-4o		Qwen2.5-VL		GPT-4o		Qwen2.5-VL		GPT-4o		Qwen2.5-VL		GPT-4o		Qwen2.5-VL
Metric	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR
01-IA	1.00	0.00	1.90	20.00	2.40	20.00	4.10	60.00	4.90	90.00	5.00	100.00	4.90	90.00	5.00	100.00
02-HS	1.19	0.00	2.56	31.25	2.88	18.75	4.44	56.25	4.62	68.75	5.00	100.00	4.56	68.75	4.88	87.50
03-MG	1.60	0.00	4.80	80.00	4.20	40.00	4.80	80.00	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00
04-PH	1.86	21.43	3.00	42.86	3.14	42.86	4.79	78.57	4.86	85.71	4.93	92.86	4.71	92.86	5.00	100.00
05-EH	2.58	16.67	4.25	75.00	2.83	16.67	4.25	58.33	4.67	75.00	4.83	91.67	4.83	91.67	5.00	100.00
06-FR	1.67	13.33	2.67	40.00	2.67	13.33	4.60	80.00	4.93	93.33	5.00	100.00	5.00	100.00	4.80	86.67
07-SE	1.73	18.18	4.82	81.82	1.55	0.00	4.45	63.64	4.45	72.73	4.73	81.82	4.45	72.73	4.82	90.91
08-PL	4.13	60.00	4.73	86.67	4.33	66.67	4.47	73.33	5.00	100.00	4.87	93.33	4.87	93.33	4.73	86.67
09-PV	1.57	14.29	4.29	57.14	2.57	28.57	4.93	92.86	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00
10-LO	3.00	15.38	3.31	23.08	2.92	7.69	3.54	30.77	4.69	84.62	4.85	84.62	4.77	84.62	4.85	92.31
11-FA	3.59	52.94	3.59	47.06	3.24	17.65	3.41	29.41	4.82	88.24	5.00	100.00	4.53	82.35	4.53	82.35
12-HC	2.91	0.00	3.91	36.36	3.27	18.18	4.09	54.55	4.82	81.82	4.27	54.55	4.45	72.73	4.82	90.91
13-GD	2.87	20.00	3.73	40.00	3.27	20.00	3.87	40.00	4.73	80.00	4.20	46.67	4.60	80.00	4.67	80.00
\rowcolorblack!3 ALL	2.36	20.24	3.60	49.40	3.01	23.81	4.26	60.12	4.80	85.71	4.82	88.10	4.73	86.31	4.83	91.07

Table 2: Performance of FigStep, FigStep-Pro, SI-Attack, VisCo Attack, our CIA, and the Text baseline across various models in terms of Toxicity and Attack Success Rate (ASR, %). Results are obtained on SafeBench-tiny. “Text” corresponds to directly using the original harmful queries without any jailbreak strategy.

GPT-4o

GPT-4o-mini

Gemini-2.0

Qwen2.5-VL

InternVL2.5

Average

Method

Toxic

ASR

Toxic

ASR

Toxic

ASR

Toxic

ASR

Toxic

ASR

Toxic

ASR

Text

1.88

10.00

2.06

16.00

1.52

2.00

1.68

6.00

1.48

4.00

1.72

7.60

FigStep

1.74

12.00

3.02

40.00

3.86

54.00

4.18

64.00

2.74

34.00

3.11

40.80

FigStep-Pro

1.82

2.00

2.22

0.00

3.36

30.00

3.04

26.00

2.78

8.00

2.64

13.20

SI-Attack

1.54

2.00

3.46

30.00

3.92

38.00

4.22

62.00

3.94

50.00

3.42

36.40

VisCo Attack

4.60

76.00

4.76

86.00

4.68

80.00

4.82

86.00

4.84

88.00

4.74

83.20

\rowcolorblack!3 CIA (ours)

4.60

86.00

4.88

92.00

4.96

96.00

4.88

92.00

4.84

88.00

4.83

90.80

We conduct extensive experiments to evaluate our proposed CIA across multiple benchmarks and target models, including both open-source and closed-source MLLMs. Furthermore, we perform an ablation study to analyze the contribution of each component of CIA. Additional experimental details and extended results are provided in the appendix.

4.1 Experimental Settings

Evaluation MLLMs.

We evaluate CIA on open- and closed-source MLLMs. For open-source models, we consider two representatives: Qwen2.5-VL-72B [qwenQwen25TechnicalReport2025] and InternVL2.5-78B [chenInternVLScalingVision2024]. For closed-source models, we evaluate three MLLMs: GPT-4o [openaiGPT4oSystemCard2024], GPT-4o-mini [openaiGPT4oSystemCard2024], and Gemini-2.0-flash [comaniciGemini25Pushing2025]. For brevity, Tab. 1 reports only GPT-4o and Qwen2.5-VL-72B, with other models in App. C.

Benchmarks and Baselines.

We evaluate CIA on two widely used multimodal safety benchmarks: MMSafetyBench [liuMMSafetyBenchBenchmarkSafety2024] and SafeBench [gongFigStepJailbreakingLarge2025]. MMSafetyBench provides 13 categories of harmful queries (e.g., illegal activity, hate speech, malware generation), where attacks are formed by modifying a question and pairing it with a related image. In our setting, we use only the harmful textual questions and regenerate the contextual images using CIA. SafeBench [gongFigStepJailbreakingLarge2025] is a multimodal safety benchmark built on typography-based adversarial injections, where harmful instructions are rendered as text on blank images. All experiments use the tiny subsets of both benchmarks: MMSafetyBench-tiny contains 168 samples, and SafeBench-tiny includes 50 harmful queries. For baselines, we compare CIA with four representative multimodal jailbreak attacks, including QR-Attack [liuMMSafetyBenchBenchmarkSafety2024], FigStep [gongFigStepJailbreakingLarge2025], SI-Attack [zhaoJailbreakingMultimodalLarge2025], and VisCo-Attack [miaoVisualContextualAttack2025], with the corresponding baseline setups described in App. B.

Implementation Details.

We cap the number of refinements on the visual text $v$ and frame structure $f$ at 5. For the target image, we add at least 3 contextual elements and cap the image-refinement process at 6 iterations. In the Dialogue Layout strategy, the harmful query is decomposed into 3 sub-questions. Additional details of the experimental setup are provided in App. B.

Evaluation Metric.

We adopt two primary evaluation metrics: Toxic Score (Toxic) and Attack Success Rate (ASR). Following previous work [qi2023towards, dingRethinkingBottlenecksSafety2025, miaoVisualContextualAttack2025], we employ the GPT-4o judge to assess response harmfulness. Given an original harmful query and model response, the evaluator assigns a Toxic Score from 1 to 5, where higher values indicate greater harmfulness and 5 means the response fully satisfies the harmful intent of the query. We regard an attack as successful only when the score is 5. The evaluation prompt is provided in App. B.3. Among the four semantic elements defined in our study, each element is suitable for different scenarios and models. Consequently, an attack is regarded as successful if either the original target image or any of the four refined target images achieves a successful attack. When calculating the Toxic Score, we select the highest score among five evaluation results for each sample under a specific scenario. We report results only for the Demonstrative Framework, with additional results for other strategies provided in Apps. C and D.

4.2 Performance on MLLMs

We conduct comprehensive experiments on the MMSafetyBench-tiny and SafeBench-tiny datasets. The overall results are summarized in Tab. 1 and Tab. 2.

CIA demonstrates superior and consistent effectiveness compared to baseline.

On MMSafetyBench-tiny, CIA achieves ASR values of 86.31% on GPT-4o and 91.07% on Qwen2.5-VL-72B. These results exceed those of QR-Attack, SI-Attack, and VisCo Attack. In terms of Toxicity, CIA obtains scores of 4.73 and 4.83 on the two models, respectively, which are noticeably higher than those of QR-Attack (2.36/3.60) and SI-Attack (3.01/4.26). This indicates that CIA not only circumvents safety protections, but also induces the models to generate clearer and more complete harmful content, rather than vague or partially filtered responses. The differences among the baselines are also pronounced. QR-Attack performs the weakest, with an ASR of only 20.24% on GPT-4o and consistently low Toxicity scores, suggesting that simple prompt rewriting rarely breaks current safety alignment mechanisms. SI-Attack improves upon QR-Attack in both ASR and Toxicity, particularly on Qwen2.5-VL-72B, where it reaches 60.12% ASR, but there still remains a large gap compared to CIA. VisCo Attack is overall stronger and closely approaches CIA on both metrics; however, CIA still achieves higher average ASR on GPT-4o (86.31% vs. 85.71%) and Qwen2.5-VL-72B (91.07% vs. 88.10%). On SafeBench-tiny, CIA attains an average ASR of 89.20%, again surpassing all baselines. Although VisCo Attack also achieves competitive results, it relies on multi-round, multi-image conversational contexts and powerful red-teaming models to construct attack prompts, which leads to significantly higher computational and monetary costs than CIA. In contrast, CIA requires generating only a single contextual image per query and adopts a simpler attack pipeline, thus offering more pronounced advantages in scalability and reproducibility.

CIA exhibits strong generalization across models and categories.

Baseline methods such as SI-Attack and FigStep achieve reasonable performance on open-source models, but their effectiveness drops sharply on closed-source ones. For example, FigStep obtains 64% ASR on Qwen2.5-VL-72B but only 12% on GPT-4o. In contrast, CIA consistently maintains ASR above 85% across all evaluated models, including GPT-4o, GPT-4o-mini, Gemini-2.0-flash, Qwen2.5-VL-72B, and InternVL2.5-78B. Across different categories of harmful instructions, CIA also exhibits remarkably stable performance. As shown in Tab. 1, it achieves near-perfect ASR in categories such as 03-MG (weapons manufacturing) and 09-PV (privacy violation), and remains strong in more challenging categories including 01-IA (illegal advice), 06-FR (fraud), and 12-HC (hate crime). This consistency indicates that CIA does not rely on category-specific templates, but instead exploits a general vulnerability in the vision–language alignment and instruction-following mechanisms of current MLLMs.

4.3 Ablation Study

To evaluate the contribution of each component of CIA, we conduct an ablation study on the SafeBench-Tiny dataset with GPT-4o as the target model; the results are reported in Fig. 4. We consider the following six variants, evaluated by toxicity score and attack success rate (ASR): (i) CIA (full), the complete pipeline with visual scene, safety icon, and contextual refinement; (ii) w/o FP, removing the frame structure and retaining only simple numbered cues; (iii) w/o VP, removing the visual text $v$ and directly using the harmful query; (iv) w/o Image, a text-only attack without any visual input; (v) w/o Context, embedding harmful content into a blank image without any contextual elements; and (vi) w/o Refinement, using only the initial target image without semantic alignment or contextual enhancement.

Impact of components in CIA.

The results show that removing the frame structure significantly reduces the ASR (from 60% to 38%), indicating its importance in guiding the model toward harmful responses. When the visual-text toxicity obfuscation is removed, GPT-4o’s ASR drops from 60% to 36%, suggesting that the model relies on toxicity obfuscation to bypass safety mechanisms. Furthermore, removing either the visual input or the image context leads to a substantial decline in both Toxicity Score and ASR, confirming the central role of visual context in the attack process. While using only the initial target image still yields moderate effectiveness, semantic alignment and the addition of contextual elements further enhance attack performance. For completeness, we also perform the same ablation on Gemini-2.0-flash and observe consistent trends. Detailed results, as well as additional experiments on different visualization strategies and different contextual-element enhancements, are provided in App. D.

4.4 Further Discussion

To probe the role of visual context in jailbreak attacks, we measure the embedding separability of benign versus harmful prompts in the InternVL2.5-78B [chenInternVLScalingVision2024] model. We generate 50 benign–harmful prompt pairs from SafeBench-tiny using GPT-4o-mini [openaiGPT4oSystemCard2024] and, following prior work [gongFigStepJailbreakingLarge2025, zhaoJailbreakingMultimodalLarge2025], analyze the model’s hidden states under two settings: (i) text-only inputs and (ii) CIA-generated image inputs, where harmful intent is embedded within visually coherent scenes. Our results show that visual context substantially weakens the model’s text-based safety alignment, effectively collapsing the separation between benign and harmful representations. As illustrated in the t-SNE visualization (Fig. 6), final-layer hidden states from text-only inputs are well separated and achieve 91% linear classification accuracy. In contrast, introducing visual context eliminates this separability, producing tightly interwoven embeddings. This effect persists across the entire network depth, as confirmed by the layer-wise analysis (Fig. 5). The figure reports the Fisher Ratio, a standard class-separability metric where higher values indicate stronger distinction between benign and harmful embeddings. For text-only inputs, separability increases steadily with depth, whereas for CIA image–text inputs it remains near zero at all layers, revealing a significant safety vulnerability induced by visual context.

5 Conclusion

In this paper, we introduce Contextual Image Attack (CIA), an image-centric jailbreak method that embeds harmful intent into semantically crafted visual contexts to evade MLLM safety mechanisms. CIA employs a multi-agent system, comprising a Parser, an Image Generator, and a dual-path Refiner, to generate contextual images that preserve semantic fidelity while effectively obscuring harmful queries. Experiments on MMSafetyBench-tiny and SafeBench-tiny show that CIA substantially outperforms existing baselines in both toxicity score and attack success rate. Ablation studies further demonstrate that visual context is critical to bypassing model safety alignment. These findings highlight overlooked vulnerabilities posed by visually encoded adversarial inputs and underscore the need for vastly more robust safety alignment for visual modalities.

Ethics Statement

This work conducts controlled jailbreak evaluations to reveal safety risks in black-box MLLMs and support stronger defense development. We emphasize the need for rigorous safety assessment before deploying MLLMs to the public.

\thetitle

Supplementary Material

Appendix A Defenses for MLLMs

To mitigate the growing threat of jailbreak attacks against MLLMs, defense strategies have evolved along two primary directions: input detection and safety alignment. In the domain of detection, specialized models such as LLama-guard [inanLlamaGuardLLMbased2023] and mutation-based techniques [zhangJailGuardUniversalDetection2025] have been developed to identify malicious intent. Xu et al. [xuCrossmodalityInformationCheck2024] utilize cross-modal similarity to identify harmful inputs, while Zhao et al. [zhaoFirstKnowHow2024] leverage the first output token for classification. In addition, commercial services such as ChatGPT [openai2023chatgpt], PerspectiveAPI [jigsaw2023perspective], and ModerationAPI [openai2023moderation] are also employed as detectors. Regarding safety alignment, recent research focuses on both dataset construction and advanced fine-tuning strategies. In terms of data and benchmarks, Zong et al. [zongSafetyFineTuningAlmost2024] introduce VLGuard for safety instruction tuning, while Qu et al. [qu2025self] propose a self-adaptive construction method for real-world safety scenarios. Furthermore, Li et al. [li-etal-2024-salad] develop SALAD-Bench to establish a hierarchical safety evaluation framework, and also present T2isafety [li2025t2isafety] for assessing fairness, toxicity, and privacy. For alignment strategies, Li et al. [li2025layer] design a layer-aware representation filtering to purify fine-tuning data, and Ding et al. [ding2025rethinking] rethink safety bottlenecks by utilizing multi-image inputs. Finally, Chakraborty et al. [chakrabortyCrossModalSafetyAlignment2025] attempt to fine-tune MLLMs exclusively in the textual domain to achieve cross-modality safety alignment.

Appendix B Experimental Setup and Implementation Details

B.1 Attack Baselines Implementation

FigStep [gongFigStepJailbreakingLarge2025].

FigStep is a black-box multimodal jailbreak attack that converts harmful textual instructions into typographic images. This shifts unsafe content from the text channel to the visual channel, helping it evade safety filters. In our experiments, we use the image-text pairs released in SafeBench-Tiny. For each harmful query, we submit the corresponding official multimodal pair to the target model without any further modification.

QR-Attack [liuMMSafetyBenchBenchmarkSafety2024].

The Query-Relevant (QR) Attack constructs an adversarial image–text pair for each harmful query by attaching a query-relevant image whose visual content explicitly reflects the critical unsafe key phrase. By default, MM-SafetyBench instantiates QR-Attack with the SD+Typo setting, where images are generated by Stable Diffusion and overlaid with typographic text encoding the unsafe key phrase. In our experiments, we follow this default SD+Typo configuration and directly adopt the MMSafetyBench-Tiny release, using its official harmful image–text pairs as fixed attack inputs.

SI-Attack [zhaoJailbreakingMultimodalLarge2025].

SI-Attack is a multimodal jailbreak method that exploits shuffle inconsistency by randomly shuffling the harmful text and image while using a judge model to retain variants that remain semantically harmful but more likely to bypass safety filters. In our implementation, we strictly follow the official protocol on both MMSafetyBench-Tiny and SafeBench-Tiny: the image is divided into four blocks and randomly shuffled, while the text is shuffled at the word level for each query. For MMSafetyBench-Tiny, we additionally append the harmful key phrase as typographic text at the bottom of the shuffled image, and for fairness we cap the maximum number of attack queries per harmful question at five.

VisCo Attack [miaoVisualContextualAttack2025].

VisCo Attack is a contextual multimodal jailbreak that constructs visually grounded dialogue histories and then refines an adversarial query that is coherent with the image yet optimized to circumvent safety alignment. For MMSafetyBench-Tiny, we directly use the officially released VisCo Attack data and configurations, querying the target models with their provided adversarial contexts. For SafeBench-Tiny, the evaluated models and prompt templates exactly match those in the original VisCo paper, so we directly reuse the results reported there. To keep the comparison fair across baselines, we allow at most five VisCo attack queries per harmful question.

B.2 Additional Implementation Details

We instantiate the auxiliary text model $\pi_{\text{aux}}^{\text{text}}$ as Qwen2.5-QwQ-37B-Eureka-Triple-Cubed-Abliterated-Uncensored [qwenQwen25TechnicalReport2025] for the Parser, Image Generator, and Text Refiner agents. For the Image Refiner agent, we instantiate Qwen2.5-VL-72B [qwenQwen25TechnicalReport2025] as the multimodal auxiliary model $\pi_{\text{aux}}^{\text{mm}}$ . The same Qwen2.5-VL-72B model is also instantiated as the evaluation model $\pi_{\text{eval}}$ for semantic consistency checking. Image generation is performed using Qwen-Image, while editing with Qwen-Image-Edit [wuQwenImageTechnicalReport2025]. Both the judge model and the target model operate with a temperature of 0.

Table 3: Attack Success Rate (ASR, %) and Toxicity of different contextual augmentation strategies under four visualization scenario strategies on SafeBench-tiny. All results are obtained using Qwen2.5-VL-72B as the target model. “CIA (w/o Aug)” denotes the baseline without contextual augmentation, while “text”, “noise”, “emoji”, and “icon” correspond to auxiliary text embedding, noise injection, emoji insertion, and safety-icon addition, respectively.

	Demonstration		Sequential Path		Structured Content		Dialogue Layout
Method	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR
CIA (w/o Aug)	4.46	66.00	3.70	46.00	4.32	60.00	4.02	44.00
CIA (text)	4.38	64.00	3.80	52.00	4.28	68.00	4.04	46.00
CIA (noise)	3.56	36.00	3.90	54.00	4.00	50.00	3.76	36.00
CIA (emoji)	4.52	78.00	3.29	42.00	3.57	30.00	3.68	34.00
CIA (icon)	4.46	70.00	4.00	62.00	4.38	66.00	3.82	36.00

B.3 Prompt Templates

Prompt Template for Toxicity Scoring

As part of our evaluation protocol, we provide the prompt template used to automatically assess the toxicity of MLLM responses. Given a harmful query and the corresponding response generated by the target MLLM, the evaluator model (GPT-4o) assigns a score from 1 to 5 based on (i) whether the response violates OpenAI’s prohibited-activities policy and (ii) the predefined toxicity-scoring criteria. Higher scores indicate more severe toxicity; the maximum score is reserved for responses that are toxic and fully aligned with the attacker’s intent. The complete template is provided below.

Prompt Templates for the Parser Agent

Within the Parser Agent, we use two prompt templates to extract the query intent and generate structured outputs: the visual text and the frame structure. The Parser first distills key semantic components from the raw harmful query using the first template. Building on these components, it then synthesizes the visual text and frame structure with the second template. The two templates are presented below.

Prompt Template for the Image Generator

After generating the visual text and the frame structure, the Image Generator composes an image description according to the specified visualization strategy and then invokes a text-to-image model to synthesize the initial candidate image. The prompt template used by the Image Generator to implement the Demonstration strategy is presented below.

Prompt template for the Text Refiner.

We present the prompt template used by the Text Refiner to perform semantic consistency checking. Using this template, the Text Refiner assesses whether a weakly aligned MLLM’s response deviates from the semantics of the original harmful query. When a deviation is detected, it produces corrected versions of the Visual Text and the Frame Structure. The full template is provided below.

Prompt template for the Image Refiner.

The Image Refiner also performs semantic consistency checking, and its prompt template is provided below. Unlike the Text Refiner, the Image Refiner works with the image generated in the previous stage. If a deviation from the intended semantics of the original harmful query is detected, it does not modify the embedded text. Instead, it provides targeted recommendations for correcting the image.

Appendix C More Results on MM-safetybench

Table 4: Comparison of Toxic (lower is better) and ASR (%) across models (QR-Attack and SI-Attack).

	GPT-4o-mini				Gemini-2.0				InternVL2.5
Model	QR-Attack		SI-Attack		QR-Attack		SI-Attack		QR-Attack		SI-Attack
Metric	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR
01-IA	1.20	0.00	3.00	20.00	1.00	0.00	4.40	50.00	1.10	0.00	3.90	40.00
02-HS	1.38	0.00	3.00	12.50	2.25	25.00	3.38	25.00	2.44	12.50	3.75	25.00
03-MG	1.00	0.00	4.00	40.00	4.20	80.00	4.20	40.00	4.20	80.00	4.40	60.00
04-PH	1.86	21.43	3.93	57.14	2.71	35.71	4.57	64.29	2.79	42.86	4.93	92.86
05-EH	3.00	25.00	3.75	33.33	2.50	8.33	3.08	25.00	3.67	50.00	3.25	16.67
06-FR	1.80	13.33	3.67	40.00	1.87	13.33	3.47	26.67	2.13	6.67	4.00	46.67
07-SE	3.18	45.45	3.00	27.27	3.82	36.36	2.55	9.09	3.64	54.55	3.00	9.09
08-PL	4.00	46.67	4.27	66.67	4.13	53.33	3.53	33.33	4.27	60.00	3.67	40.00
09-PV	1.57	14.29	3.93	57.14	2.64	28.57	4.43	57.14	3.93	64.29	4.43	71.43
10-LO	3.38	23.08	3.00	7.69	3.31	38.46	2.62	0.00	3.77	46.15	2.77	0.00
11-FA	3.24	35.29	3.29	23.53	3.47	41.18	2.71	0.00	3.47	41.18	3.35	29.41
12-HC	3.09	9.09	3.36	18.18	3.18	9.09	3.73	27.27	3.27	9.09	3.09	18.18
13-GD	3.07	6.67	3.33	13.33	3.33	26.67	3.13	6.67	3.40	26.67	3.40	13.33
ALL	2.52	19.64	3.49	32.14	2.92	29.17	3.47	26.79	3.21	36.31	3.67	35.12

Table 5: Comparison of Toxic (lower is better) and ASR (%) across models (VisCo Attack and CIA).

	GPT-4o-mini				Gemini-2.0				InternVL2.5
Model	VisCo Attack		CIA		VisCo Attack		CIA		VisCo Attack		CIA
Metric	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR
01-IA	4.90	90.00	4.80	80.00	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00
02-HS	4.94	93.75	4.69	75.00	4.88	93.75	4.75	81.25	4.94	93.75	4.75	81.25
03-MG	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00
04-PH	4.57	78.57	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00
05-EH	5.00	100.00	4.83	91.67	4.83	83.33	4.67	83.33	5.00	100.00	4.67	83.33
06-FR	4.93	93.33	4.93	93.33	5.00	100.00	4.87	93.33	5.00	100.00	4.87	93.33
07-SE	4.64	72.73	4.82	81.82	4.55	72.73	4.45	63.64	4.55	72.73	4.36	54.55
08-PL	5.00	100.00	4.73	86.67	5.00	100.00	4.80	86.67	5.00	100.00	4.73	86.67
09-PV	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00	5.00	100.00
10-LO	4.38	61.54	4.69	84.62	4.69	76.92	4.46	69.23	4.54	76.92	4.62	76.92
11-FA	4.82	88.24	4.76	88.24	4.94	94.12	4.41	76.47	4.94	94.12	3.88	52.94
12-HC	4.64	63.64	4.55	72.73	4.91	90.91	4.27	72.73	4.91	90.91	4.55	72.73
13-GD	4.73	73.33	4.60	80.00	4.93	93.33	4.60	86.67	4.73	80.00	4.60	80.00
ALL	4.81	85.71	4.79	86.90	4.90	92.86	4.70	85.12	4.89	92.86	4.67	82.14

We report additional MMSafetyBench-tiny results including evaluations on GPT-4o-mini [openaiGPT4oSystemCard2024], Gemini-2.0-Flash [comaniciGemini25Pushing2025], and InternVL 2.5-78B [chenInternVLScalingVision2024]. The results of QR-Attack and SI-Attack are presented in Tab. 4, while the results of VisCo Attack and CIA are shown in Tab. 5.

Appendix D More Results on SafeBench

Additional Visualization Strategies.

Table 6: Attack Success Rate (ASR, %) and Toxicity of our CIA method under three additional visualization scenario strategies: Sequential Path (Se), Structured Content (St), and Dialogue Layout (Di), evaluated on SafeBench-tiny across five target models. All strategies share the same experimental configuration as the Demonstration setting. “Se”, “St”, and “Di” correspond to the scenario definitions provided in the main text.

	GPT-4o		GPT-4o-mini		Gemini-2.0		Qwen2.5-VL		InternVL2.5		Average
Method	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR	Toxic	ASR
CIA (Se)	3.76	42.00	4.50	74.00	4.92	94.00	4.86	92.00	4.50	72.00	4.51	74.80
CIA (St)	4.12	64.00	4.48	70.00	4.84	92.00	4.90	92.00	4.86	92.00	4.64	82.00
CIA (Di)	1.80	14.00	4.38	60.00	4.60	74.00	4.62	72.00	4.50	62.00	3.98	56.40

We evaluate our approach on the SafeBench-tiny dataset using three additional scenario strategies: Sequential Path (Se), Structured Content (St), and Dialogue Layout (Di). The detailed definitions of these scenario strategies are provided in the main text. For consistency, the experimental setup for these additional scenarios follows the same configuration as used for the Demonstration scenario strategy. The attack success rate (ASR) and toxic score under different visualization strategies are reported in Tab. 6.

Contextual Augmentation Strategies.

In addition, we evaluate various contextual augmentation strategies on the SafeBench-tiny dataset. For cost-efficiency, we select Qwen2.5-VL-72B as the target model and use the configuration without contextual augmentation as the baseline. The attack success rates and toxic scores under different settings are reported in Tab. 3. The results indicate that the effectiveness of contextual elements varies across scenario strategies. Under the Demonstration strategy, inserting emoji yields the strongest performance, achieving an attack success rate of 78%. For the Sequential Path strategy, adding safety icons is most effective, likewise reaching 62% ASR. In contrast, for the Structured Content and Dialogue Layout strategies, incorporating auxiliary text provides the greatest improvement, achieving 68% and 46% ASR, respectively. Conversely, introducing noise as a contextual element provides only limited benefits and leads to improvement exclusively under the Sequential Path strategy.

Additional Ablation Study.

The ablation study reports the evaluation results obtained when removing different components using GPT-4o as the target model. We also conducted experiments on the Gemini-2.0-Flash model, and the corresponding results are shown in Fig. 7.

Appendix E Case Studies

In this section, we present four case studies of scenario-policy-driven CIA attacks. Each case study comprises two structured components: (i) the visual text and frame structure, and (ii) the target contextual image. We report the corresponding responses produced by different models (see Figs. 8, 9, 10 and 11). Additionally, we illustrate the strategy of adding auxiliary text contextual elements in Fig. 12.