Technical concept · Multimodal jailbreaks

Multimodal jailbreak detection

Most engineers conflate two related but technically distinct attack classes: image prompt injection and image jailbreaking. Both exploit the visual modality of a multimodal LLM. Both are invisible to text-only content moderation. But they work through different mechanisms, achieve different attacker goals, and require different detection signals to catch. Understanding the distinction — and deploying a scanner that covers both — is the baseline for any production multimodal AI deployment. This page explains the taxonomy, gives concrete examples of each class, and describes how Glyphward's scanner detects both at the image input level before either attack reaches the model.

TL;DR

Image prompt injection uses visible or near-visible adversarial text to redirect the LLM's instructions. Image jailbreaking uses adversarial pixel perturbations or carefully constructed visual content to bypass the LLM's safety training. Glyphward's scanner detects both — typographic instruction patterns and adversarial pixel signatures — in a single /v1/scan call. Free tier — 10 scans/day, no card required.

Image prompt injection vs image jailbreak — the taxonomy

Image prompt injection

Prompt injection uses the image as a delivery channel for instruction text that the vision model reads and executes as if it were a legitimate system or user instruction. The attacker's goal is to redirect the model's behaviour from its intended task to an attacker-specified task. The injection payload is usually human-readable text embedded in the image — rendered at low contrast, small font size, in blank regions of a document, or overlaid on a background that matches the text colour.

Examples: FigStep (typographic text instructions in images), indirect PI via web page screenshots, document injection in PDF-embedded images, whiteboard PI in screenshot-reading agents. The attacker does not need to understand the model's internal representations — they only need to know the model can read text in images, which is a published capability of every major vision LLM.

Detection approach: OCR-based extraction of text from image pixels, followed by instruction-pattern classifiers (imperative verb + target action patterns that match known PI payload structures). Glyphward's scanner uses a CLIP image encoder to score text-in-image salience, Tesseract OCR with a small YOLO text-region head to extract candidate instruction strings, and a fine-tuned classifier trained on the Glyphward corpus of known-malicious PI payloads.

Image jailbreaking

Jailbreaking uses adversarial pixel perturbations — typically imperceptible to human viewers — to alter the model's internal activations in a way that makes the model ignore its safety training and comply with harmful requests. The attacker does not need to include readable text in the image; instead, they craft a pixel pattern (often via iterative gradient-based optimisation against a surrogate model) that, when processed by the target model's vision encoder, produces activations that suppress the model's RLHF-learned refusal behaviour.

Universal adversarial perturbations (UAP). A UAP is a single additive noise pattern that, when overlaid on any image, causes the target model to comply with a harmful text prompt appended alongside the image. The perturbation is crafted to transfer across diverse image content — a single crafted noise pattern that works regardless of the legitimate image being submitted alongside it. Published UAP attacks against GPT-4V, Claude Vision, and Gemini Vision have demonstrated 50–80% attack success rates in white-box settings and 20–40% in black-box transfer settings.

Visual adversarial suffixes. Analogous to text adversarial suffixes (GCG attack), visual adversarial suffixes are small image patches appended to the corner of a submitted image that induce jailbreak behaviour. The patch is optimised to maximally suppress safety behaviour while minimising perceptibility. Unlike UAPs, visual adversarial suffixes are typically paired with a specific harmful text request and optimised jointly.

Steganographic jailbreaks. Some jailbreak techniques use steganography — hiding data in the least-significant bits of image pixels — to encode harmful instructions or model activation patterns that bypass safety training. These are harder to craft but also harder to detect, as the image appears completely normal to any visual inspection.

Detection approach: Pixel-domain statistical analysis (perturbation smoothness, high-frequency coefficient variance, LSB entropy), CLIP embedding outlier detection (comparing the image's CLIP embedding to a reference distribution of natural images, flagging embeddings that fall far outside the natural-image manifold in specific directions associated with adversarial perturbations), and known-pattern matching against a curated corpus of published UAP and adversarial suffix patterns. Glyphward's waveform anomaly classifier applies similar analysis to audio inputs for completeness.

Concrete attack examples

FigStep (prompt injection class)

FigStep is a published typographic attack (Gong et al., 2023) in which harmful instructions are rendered as text in an image — typically styled with a "step-by-step" list format that bypasses text-level content filters. The model reads the image text as if it were a user instruction and follows the harmful steps. Detection: OCR extracts the instruction text; the classifier identifies the imperative + harmful-action pattern. See FigStep detection for details.

Universal Adversarial Perturbations (jailbreak class)

A UAP crafted against GPT-4V's vision encoder (white-box) adds pixel noise that is visually indistinguishable from JPEG compression artefacts but reliably induces the model to comply with a harmful text prompt when that prompt is submitted alongside any image containing the perturbation. Detection: Glyphward's pixel-domain analysis computes the high-frequency coefficient variance of the submitted image and compares it to the expected variance distribution for natural images at the same JPEG quality level. Adversarial perturbations typically produce anomalous variance signatures detectable at image-quality-level analysis.

AgentTypo (prompt injection class)

AgentTypo substitutes look-alike Unicode characters (homoglyphs) or typo-style character swaps for characters in a legitimate instruction, causing the text safety filter to see benign text while the vision model (reading the rendered glyphs) sees the harmful instruction. Detection: glyph normalisation followed by OCR; the classifier operates on the normalised text.

Bad Pixels / Visual Adversarial Suffix (jailbreak class)

A crafted 50×50 pixel patch appended to the bottom-right corner of any submitted image that was optimised (via projected gradient descent against the Claude 3 Opus vision encoder) to maximally suppress refusal behaviour for a specific category of harmful requests. The patch is visually similar to a watermark or a digital artefact. Detection: Glyphward's CLIP embedding outlier detector flags images whose embeddings fall outside the natural-image manifold in the jailbreak-associated subspace, even when the image content appears entirely normal.

Why text-only scanners miss both attack classes

Text-only prompt injection scanners (Lakera Guard, LLM Guard text API, Azure Prompt Shields text endpoint) operate on the text content of the prompt — the system message and the user message string. Neither attack class puts the malicious content in the text prompt:

Image PI attacks embed the instruction in the image pixels — the text prompt submitted alongside the image is typically benign (e.g., "Please analyse this document").
Image jailbreak attacks typically submit a normal harmful text prompt alongside an adversarially perturbed image — but the text prompt alone is not a reliable signal, because the text is often a borderline request that only becomes harmful in combination with the jailbreak image.

Both attacks exploit the same blind spot: the scanner checks the text, not the image. Glyphward's scanner checks the image. See Why text-only scanners miss image prompt injection for the full architectural analysis.

Integration — single scan covers both threat classes

import httpx, base64

async def scan_for_multimodal_threats(image_bytes: bytes) -> dict:
    resp = await httpx.AsyncClient().post(
        "https://glyphward.com/v1/scan",
        headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
        json={"image": base64.b64encode(image_bytes).decode()},
        timeout=10.0,
    )
    resp.raise_for_status()
    result = resp.json()
    # result["score"]      — combined PI + jailbreak risk score (0–100)
    # result["signals"]    — list of triggered signal names (Pro/Team tier)
    # result["scan_id"]    — audit reference
    return result

The combined score reflects both injection-signal and jailbreak-signal detectors. A high score on an image that contains no human-readable text is a strong indicator of a jailbreak-class attack (adversarial perturbation or visual suffix). A high score on an image containing instruction-like text is a strong indicator of a PI-class attack. The Pro tier's signals field names which detectors fired, allowing downstream logic to distinguish the attack class if needed.

Get early access

Related questions

Does Glyphward detect novel jailbreak techniques not in the training corpus?

The CLIP-based embedding outlier detector does not rely on exact matching to known attacks — it flags images whose embeddings fall outside the natural-image manifold in directions associated with adversarial perturbations generally, including novel variants that weren't in the training corpus. The OCR + instruction-classifier stack is more pattern-dependent and may miss novel PI formulations. For maximum coverage, use threshold 60 (rather than 70) on high-risk applications, and subscribe to Glyphward's Pro webhook alerts for new attack vectors added to the corpus.

What is the false-positive rate on legitimate photos?

Benchmark false-positive rates on clean natural-image datasets (ImageNet validation, MS-COCO val) are below 2% at threshold 70 and below 5% at threshold 60. The primary false-positive source is high-frequency artefacts in heavily compressed JPEG images (quality below 50) that occasionally trigger the pixel-domain jailbreak signal. If your application processes low-quality compressed images frequently, use threshold 65–70 and enable Pro-tier signal breakdown to distinguish compression artefacts from genuine adversarial signals.

How do multimodal jailbreaks differ from text adversarial suffixes (GCG)?

GCG (Greedy Coordinate Gradient, Zou et al., 2023) generates adversarial text suffixes — strings of tokens that, appended to a harmful request, cause the model to comply. These are text-domain attacks and detectable by text-layer scanners. Visual adversarial suffixes are the image-domain analogue: pixel patches appended to an image that achieve the same jailbreak effect through the vision encoder. They are not detectable by text-layer scanners because the harmful content is in the pixels, not the tokens. Glyphward detects the pixel-domain variant; text-layer scanners should handle the text-domain GCG variant.

Is steganography-based injection detectable?

LSB (least-significant-bit) steganography — hiding data in the low bits of image colour channels — is detectable through LSB entropy analysis: natural images have characteristic LSB entropy distributions; images with hidden data have anomalous distributions. Glyphward's pixel-domain analyser includes LSB entropy as one of its jailbreak signal inputs. Detection rate on naive LSB steganography is high; detection rate on adaptive steganography (using statistical matching to disguise the entropy) is lower. High-security applications should treat any anomalous LSB entropy as a flag worth investigating, even if the overall score is below threshold.