Blog · Architecture · 2026-04-25

Why every text-only prompt-injection scanner misses a 30-pixel PNG

A 900-byte image with eight rendered words on it routes around every text-only prompt-injection defender on the market. That is not a tuning failure or a missing rule — it is the intended scope of those products, and the gap will not close by improving them. This post is the architectural argument, written for engineers and AppSec leads deciding whether their current defence is enough.

TL;DR

Text-only PI scanners ship a single contract: string in, score out. A vision-language model accepts the image directly and reads instructions off pixels that an OCR pre-stage cannot recover, by design. Bolting OCR in front of the text scanner narrows the gap but does not close it — the attack class is specifically constructed to be legible to the VLM and illegible to a character classifier. The only defence with the right shape is one that inspects the bytes the model reads — pixels — and returns a region-and-score before the model is invoked. See typographic prompt-injection scanner for the category page, or jump to the fix.

1. The contract that breaks

Every public-API prompt-injection scanner in production today — Lakera Guard, LLM Guard, Azure Prompt Shields, Promptfoo's PI evals, and the indie copies that have followed since 2023 — exposes the same interface shape. You hand the scanner a string. The scanner returns a verdict and a score. The interface is stable, fast, and easy to reason about. It was the right interface in 2023, when the prompt-injection problem was overwhelmingly a chat-UX bug: a user pastes ignore previous instructions into a textarea, the model swallows it, the scanner catches it. String in, score out covered the whole attack surface that mattered.

The contract started to break the moment models began accepting non-text inputs as a normal part of the conversation. Vision-language models read images directly. Speech models read waveforms directly. Document models read scanned PDFs directly. None of those reach a string until after the model has already inspected them. Any scanner whose contract is string in is structurally upstream of an input it cannot see. That is not a deployment mistake — it is the inevitable consequence of the interface shape. A defender that does not see the bytes the model reads cannot defend against payloads carried in those bytes.

The narrow framing — "we have a text PI problem and a separate, optional, future image PI problem" — is the framing that lets text-only scanners ship without an asterisk. The honest framing — "the input boundary now has multiple modalities and the scanner needs to cover all of them" — is the one that makes the gap obvious. This post argues for the second framing.

2. What the model sees vs what your scanner sees

Imagine a 200×120 PNG with eight words rendered on it: a single short instruction, drawn in a slightly distorted font, pasted into a chat with a polite carrier prompt ("please read the figure and follow the steps"). The asset is under 1 KB on disk. A modern VLM — GPT-4V, Claude Vision, Gemini, Llama-vision — reads the eight words cleanly. Its optical-character pathway is trained on enough rendered-text variation to recover the instruction even at low resolution and through mild distortion. The instruction lands in the model's context as effectively as if you had typed it.

Your text PI scanner, sitting on the chat-API gateway, sees the carrier prompt. It does not see the image. The carrier is a single innocuous sentence. The scanner returns safe. The model does what the picture said. This is the FigStep recipe at its simplest, and a version of it has been published, reproduced, and extended in the academic literature since 2023 (arXiv:2311.05608). For the page that catalogues the technique and the detection signal we use against it, see FigStep detection.

The gap between what the model sees and what your scanner sees is the entire problem, expressed in one sentence. The defender's job is to close that gap. There are exactly two architectural ways to do it: bring the scanner up to the model's input modality (inspect pixels), or bring the input down to the scanner's modality (recover pixels into text via OCR). The next section is about why the second one — the obvious one — has a structural ceiling below the attack.

3. Why "just OCR it" hits a ceiling

The natural reflex when an engineer first hears the FigStep argument is the same one most teams have already tried: OK — we'll OCR every uploaded image, then run our existing text PI scanner on the OCR output. This buys you something. It catches the unsophisticated payloads. The first round of FigStep-style demos, where the attacker just typed words onto a white background in 24-point Arial, fall straight to a Tesseract pass. If you do nothing else this week, do that.

It is not, however, a defence. The attack class is specifically structured to walk past a character classifier while staying legible to a VLM. Three families of technique do this, and each one has been demonstrated in published work since 2024. Per-glyph pixel perturbation: small, targeted noise added to each rendered character that flips the OCR's per-character classifier to garbage while leaving the VLM's broader visual prior able to recover the word. Anti-OCR fonts: glyph shapes designed to be visually unambiguous to a human or VLM but to land in the failure regions of common OCR character models. Unicode confusables: rendered glyphs that look like ASCII letters but encode to Cyrillic, Greek, or fullwidth variants that a downstream string match never fires on. The umbrella reference is the AgentTypo detector page, which walks through each technique and the detection signal we use.

The structural reason these attacks succeed against an OCR-then-text-scan pipeline is the asymmetry between the two readers. The VLM is trained on enormous, diverse visual data and develops a robust prior over what reads as a letter. A character-level OCR is trained on a far narrower distribution and is brittle outside it. Closing the gap requires retraining the OCR on adversarial samples — at which point you are no longer running a text-only scanner with an OCR pre-stage; you are running an image-aware classifier with extra steps and worse latency. The right move is to build the image-aware classifier on purpose, with the right contract.

The numbers in the literature back this up. FigStep, AgentTypo, and successor papers have consistently reported attack success rates in the 60–90% range against the strongest public VLMs at time of writing, with OCR-recovery rates on the same payloads collapsing to near zero. The VLM reads it; the OCR cannot asymmetry is not a quirk of one paper — it is the central design property of the attack class.

4. The contract that holds

A scanner that has a chance of catching this attack class must accept the same input the model accepts. For an image surface, that means image bytes in, score and region out. The score lets you gate; the region lets a reviewer audit. A boolean is not enough — when an attack does fire, the difference between "the whole asset is poisoned" and "there is a 40×40 patch in the corner" is exactly the difference between blocking a legitimate user and blocking the attack alone. This is the contract Glyphward's typographic-PI scanner ships with, and the contract every credible image-PI defender has converged on independently.

Inside the contract, the implementation that has worked best in our corpus is a small ensemble of cheap signals run in parallel rather than one heavy classifier. A text-in-image likelihood head over the visual embedding catches the basic FigStep family; a nearest-neighbour against a curated payload corpus catches re-uses of known attacks at near-zero cost; an OCR pass with Unicode confusable normalisation catches the long-tail of unsophisticated payloads cheaply; and a perturbation-signature head catches the adversarial-glyph and anti-OCR-font families that a clean OCR would miss. None of those signals alone covers the attack space. Together they get to a defensible posture for an SMB-tier product. The architecture is documented in more detail on the screenshot-agent scanner page, which walks through the same four-signal ensemble for the screenshot-reading-agent ICP.

Two operational notes that matter as much as the model architecture. First: the scanner has to run before the VLM call, not after. Once the instruction reaches the model, the model has already read it; a downstream output filter is a cleanup operation, not a defence. Second: thresholds must be source-aware. Images your own backend rendered (charts, status cards, generated UI) clear at a permissive threshold; images uploaded by users or fetched from third-party URLs clear at a strict one. The scanner returns a raw score; the integration decides the policy per source. This is the highest-leverage configuration knob in the stack. The 2026 multimodal threat model covers the full defender's playbook end-to-end.

5. What this means for buying decisions

If you already pay for a text PI scanner, do not rip it out. The text contract is still load-bearing — your gateway sees text on every request, and the existing scanner is the right tool for that surface. The honest Glyphward vs Lakera Guard, vs Azure Prompt Shields, and vs LLM Guard pages each lay out the run-both pattern explicitly. Image-PI defence is additive to text-PI defence, not a replacement for it.

What you should not do is pretend the gap is closed because you have some PI scanner in production. The relevant question for a CTO or AppSec lead is the one most teams skip in the first audit: does our defence cover every modality our model accepts? If the answer is no, the gap is the size of every untrusted image and audio clip you ingest, and the published attack literature has been telling you what walks through that gap since 2023. The Sept–Nov 2025 Check Point acquisition of Lakera, which is shifting that vendor up-market, has compounded the squeeze on the SMB self-serve tier. Indie buyers now have fewer options for a text scanner and zero options for a multimodal one priced under $100/mo. We built Glyphward into that gap on purpose. The Lakera alternative (multimodal) page is the long-form version of the buyer-facing argument.

If you want to see the contract in action — image bytes in, score and region out — the embed widget preview mounts three live versions of the scanner on a single page; copy the snippet and you have the contract running on your own domain in under a minute. The free tier covers 10 scans/day with no card, which is enough to wire the integration and test it against a corpus of public FigStep samples before you commit.

FAQ

Can I just bolt OCR in front of my existing text PI scanner?

You can, and it catches a useful fraction of unsophisticated payloads. But the attack class is specifically designed to keep recognisable-to-the-VLM text away from a character classifier. Adversarial glyphs, anti-OCR fonts, Unicode confusables, and per-pixel perturbation all compress OCR recall to near zero while leaving the VLM's reading intact. OCR-then-text-scan is a useful first speed bump, not a defence.

Are large vision-language models robust enough that this stops mattering?

No. The published numbers from FigStep, AgentTypo, and follow-on adversarial-typography work consistently report attack success rates in the 60–90% range against the strongest public VLMs at the time of publication. Each VLM generation has narrowed the gap on some attacks while widening it on others — it is a moving target, not a converging one. Defence-in-depth at the input boundary stays load-bearing.

What is the smallest payload that can carry an instruction?

Attack literature has demonstrated working payloads at around 30×30 pixels of rendered glyph, well under 1 KB on disk. The model's optical reading does not need a high-resolution image to recover legible characters. Your scanner needs to be cheap enough at that resolution to be worth running on every untrusted image, not only on suspect ones.

Does this affect us if we never display images to a model — only embed them?

Embedding-only pipelines reduce the attack surface but do not eliminate it. Visual embeddings carry the instruction signal, and downstream similarity-search or classifier heads can act on it. Embedding-time scanning is the right answer for these architectures and is exactly the pattern Glyphward's pixel-level signals are designed for.

How does this compare to the audio side of the problem?

The structure is identical. Audio prompt injection (WhisperInject and successors) puts the payload in waveform features that survive the speech model but are mangled or dropped by transcription. The text-only-scanner gap on audio is the exact analogue of the gap on images. The defender's playbook composes across modalities cleanly — see audio prompt-injection detection.