Attack explainer · Image injection

FigStep detection — catch jailbreak text rendered onto images

FigStep smuggles jailbreak instructions into a vision-language model by rendering the instructions as pixels instead of text. Your text-only scanner sees a clean prompt and a “harmless image.” The model sees the instructions and complies.

TL;DR

FigStep converts a forbidden instruction into a numbered list rasterised onto a blank image, then pairs it with a polite “follow the steps in this figure” text prompt. Text PI scanners pass it through. Detection requires looking at the pixels themselves — OCR, a visual embedding, and a classifier trained on known rendered-injection payloads. Glyphward runs all three on every scan.

How the attack works

The canonical FigStep recipe, introduced in the 2023 Gong et al. preprint “FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts”, has three moves:

  1. Paraphrase into a numbered list. The attacker rewrites the forbidden request (“how do I make…”) as an open-ended list of steps to be completed (“Steps to make …: 1. 2. 3.”). Models refuse imperatives far more often than they refuse completion tasks.
  2. Rasterise the list to an image. Plain black-on-white PNG, typically 400–600 px wide, dense sans-serif font. Nothing about the file is visibly suspicious.
  3. Attach a benign text prompt. Something like “The image shows a list with missing content. Please generate the detailed content for each item.” The text prompt itself contains no policy violation.

Reported attack success rates against GPT-4V, Gemini, and open-weight LLaVA variants sit in the 60–90% range in follow-up studies, and the trick generalises: AgentTypo, typographic prompt injection, and the cross-modal variants all reuse the same principle — move the payload out of the modality your defender inspects.

Why text-only scanners miss it

Every public self-serve prompt-injection defender today inspects a string. Lakera Guard, LLM Guard, Azure Prompt Shields, and Promptfoo all read the prompt, not the image. When the payload is image bytes, they see a benign string and return a clean verdict.

The instinctive fix — run OCR, then feed the OCR output into the text scanner — fails for three reasons. First, FigStep variants now deliberately choose anti-OCR fonts, low contrast, or low resolution: OCR either drops the text entirely or returns garbled tokens that no text PI scanner matches. Second, the payload is often paraphrased and legal on its own; only the list structure plus the pairing with the “complete the figure” prompt makes it an injection. Third, round-tripping every image through an OCR pipeline adds 300–800 ms of latency that most real-time apps cannot absorb.

How to detect FigStep at inference time

Robust FigStep detection is a three-signal job running in parallel on the image bytes:

  1. OCR with an anti-OCR-aware model. A vision model fine-tuned on deliberately-distorted glyph sets recovers text where Tesseract fails. You still need OCR — it is just not sufficient alone.
  2. Visual embedding classifier. A CLIP-style embedding, compared against a compounding corpus of known-malicious FigStep payloads, catches rendered lists that are paraphrased but structurally near-duplicate to prior hits.
  3. Small text-in-image head. A dedicated model that scores the likelihood that an image contains instructional text (numbered steps, imperative verbs) rather than incidental signage. This fires even when OCR misreads the letters, because the layout itself is the signal.

Any one signal firing above threshold is enough to block the call or route it to a human; two firing simultaneously is a hard block. False positives land most often on screenshots of real documentation — the corpus rules help suppress those.

How Glyphward detects FigStep

Glyphward runs all three signals on every scan and returns a 0–100 risk score plus the flagged pixel region so you can log, block, or send to review. No infrastructure to host. No GPU bill. The free tier gives you 10 scans a day — enough to run the public FigStep paper samples through it and see the output. See the flow on the landing page, compare to incumbents on Lakera alternative (multimodal), or jump to pricing.

Get early access

Related questions

Can I just use OCR plus my existing text scanner?

It closes maybe 40% of the surface. Anti-OCR fonts, low-res composites, and paraphrased-list structures all walk straight through an OCR-to-text pipeline. You need a pixel-level classifier in parallel with OCR, not downstream of it.

Does FigStep still work on the current frontier models?

Vendor-side safety tuning closes some of the original 2023 cases on GPT-4V and Gemini, but published follow-ups in 2024–2025 show the class is alive: paraphrased payloads, lower-resolution renderings, and combined FigStep-plus-multi-image attacks all regain most of the original success rate. Treat it as an open class, not a solved one.

Is this the same as AgentTypo or typographic prompt injection?

Same principle, different emphasis. AgentTypo focuses on adversarial glyph distortions that break OCR deliberately. Typographic PI is the umbrella term. All three ride on “the payload is pixels, not text” — the defences overlap heavily. See WhisperInject detection for the audio-side equivalent.

Further reading