Blog · Threat model · 2026-04-25

The multimodal prompt-injection threat model for AI product teams (2026)

Every public-API prompt-injection defender ships with the same blind spot: they inspect text and ignore the two modalities where the real-world payloads now hide. If your product accepts images or audio from anyone other than you, this is your threat model — what the attacks look like, why your current stack misses them, and a defender's playbook you can run this week.

TL;DR

Prompt injection is no longer a text-only problem. Payloads now travel as rendered instructions in images (FigStep, AgentTypo) and as out-of-band carriers or adversarial waveforms in audio (WhisperInject). Text-only scanners are structurally blind to them by design. The fix is a scanner that inspects pixels and waveforms directly, gated in front of the vision or speech model — not after it — with source-aware thresholds and a region-and-score output an incident reviewer can audit.

1. A short primer on prompt injection

Prompt injection is the class of attacks where an adversary smuggles instructions into a model's context through data the model was supposed to treat as inert. The model cannot reliably tell the difference between "instruction from my operator" and "instruction embedded in the user's document, image, or audio", so it does what the embedded instruction says — exfiltrate, bypass a rule, call a tool, produce restricted output. The seminal paper is Greshake et al. 2023, Not what you've signed up for (arXiv:2302.12173), which formalised indirect prompt injection: the payload does not come from the chat user — it rides in on a third-party artifact the model ingests. That generalisation is why the problem stopped being a chat-UX quirk and became a systems-security problem. For a short reference page see indirect prompt injection via images.

2. What changes in a multimodal stack

Once your model accepts anything other than text — an image, a screenshot, a voice clip, a PDF render, a waveform recorded in a call — the payload surface widens in two directions at once. The first is obvious: attackers can now hide instructions inside pixels or audio, and your existing text prompt-injection scanner sees none of it. The second is more subtle: the natural defensive reflex — OCR the image, transcribe the audio, then run my existing text scanner — has a ceiling below the attack. Modern VLMs and speech models read through distortions that shred naive OCR and STT. The attacker's budget to stay legible-to-the-model but illegible-to-your-recogniser is real and measurable, and every credible adversarial-typography or adversarial-audio paper since 2023 has demonstrated it.

The central thesis of this threat model: detection has to live at the byte layer (pixels and waveforms), before the model sees the input, not after recovery into text. Recovery-then-scan is a speed bump, not a barrier. For the long-form defence argument see the typographic prompt-injection scanner overview.

3. Attack families every product team should know

3.1 FigStep — instructions rendered onto an image

FigStep (arXiv:2311.05608) was the first widely-cited jailbreak to demonstrate that rendering a prohibited instruction as readable text onto an image, paired with a polite text prompt ("please follow the steps in the figure"), bypasses safety alignment in GPT-4V, Gemini, and LLaVA variants. The VLM's optical-character reading is strong enough to recover the instruction; its safety policy, trained largely on text pairs, does not consistently reject the rendered-as-image variant. Detection signal: typographic layout inconsistent with a natural photo or a legitimate chart, paired with instruction-shaped language. See the full write-up at FigStep detection.

3.2 AgentTypo — adversarial glyphs that beat OCR

AgentTypo and the adversarial-typography line that followed in 2024–2025 specifically target the OCR-first defender pattern. Techniques include per-glyph pixel perturbation (the VLM still reads the character; Tesseract returns gibberish), Unicode confusable substitution before rendering (the rendered glyph looks like "a" but encodes to a Cyrillic or Greek lookalike), and anti-OCR fonts designed to be visually unambiguous to a human or VLM but pathological to a character classifier. An OCR→text-scanner pipeline has a structural ceiling below these attacks — the text scanner never sees recoverable text. See AgentTypo detector.

3.3 WhisperInject — instructions embedded in audio

WhisperInject and subsequent audio-PI work (see WhisperInject detection) move the payload into the waveform. Variants include out-of-band carriers (inaudible or near-inaudible frequency bands that the STT front-end still picks up), silence steganography (instructions embedded in what sounds like a pause), adversarial waveform perturbation (small additive noise that flips the transcription to an attacker-controlled string), and multi-speaker overlay (a low-gain second voice issuing instructions under the primary speaker). Whisper-family models are the best-known victim, but the problem is structural to any lossy STT. For the broader category page see audio prompt-injection detection.

3.4 Indirect image PI — the class that hits agents hardest

The category that has produced most real-world incident volume since 2024 is indirect image PI: the image the model ingests did not come from the chat user — it came from a webpage an agent fetched, a Slack thread rendered into a screenshot, a profile photo pulled from an external URL, a Jira ticket attachment. Every one of those is a legitimate, expected input. Every one of those is attacker-controllable if the attacker can get content onto the source surface. For architecture teams, this is the risk to take seriously — because blocking it means redesigning feature paths, not just tuning a detector.

3.5 Typographic composites and screenshot-as-payload

The remaining families are composites: a legitimate chart with one injected row, a product screenshot with a small overlay, a PDF page where one paragraph was replaced. Detection needs to return a bounding region, not just a boolean — so the reviewer can tell whether the whole asset is poisoned or only a corner. The umbrella reference is the typographic prompt-injection scanner overview.

4. Why text-only scanners miss all of this

Every widely-deployed public-API prompt-injection defender — Lakera Guard, LLM Guard, Azure Prompt Shields, Promptfoo, and the indie copies that followed — was built for text in, boolean or score out. That design was correct in 2023. It is now incomplete. The blind spot is not a bug; it is the scope the product shipped. Bolting an OCR or STT stage in front of the existing scanner recovers some traffic but has the structural ceiling described in section 2. If your defence-in-depth consists only of text PI scanning, and your product accepts images or audio, your defence is shorter than you think. For an honest comparison see Glyphward vs Lakera Guard, vs Azure Prompt Shields, and vs LLM Guard — each page credits what the incumbent does well and is explicit about where the multimodal gap sits.

5. The defender's playbook

Five concrete steps, in the order you should take them. None of them require a rewrite — they are things you can add in front of your existing model calls.

Inventory untrusted input surfaces. List every code path where an image or audio clip arrives from outside your own pipeline: user uploads, fetched URLs, webhook payloads, screenshots taken by agents, third-party APIs, render-then-ingest jobs. Each one is a candidate for a scanner call. You cannot defend surfaces you have not named.
Put a byte-level scanner in front of the model. For each surface, call a multimodal PI scanner before the VLM or STT model. Gate on the returned 0–100 risk score. If you are integrating Glyphward, the endpoint is POST /v1/scan with the image or audio bytes; the response carries score and flagged region.
Apply source-aware thresholds. An image your own backend rendered (a chart, a status card) should clear at a permissive threshold. An image uploaded by a user or fetched from a third-party URL should clear at a stricter threshold. The scanner returns a raw score — your integration decides the thresholds per source. This is the highest-leverage knob in the stack.
Ensemble with pixel-level signals for images. A defensible image pipeline combines OCR with Unicode confusable normalisation, an instruction-layout classifier over visual embeddings, nearest-neighbour against a curated attack corpus, and an adversarial-perturbation detector. No single head handles the whole attack space — the four-signal ensemble is the minimum for an SMB-tier product posture.
For audio, run two signals in parallel, not series. A waveform-anomaly classifier flags perturbation and out-of-band energy that never reaches your STT; a transcript-side text PI filter catches cleanly-transcribed payloads. Running them in series lets each one discard signal the other needed. Running them in parallel and merging scores is cheap and dominant.

Log the score, the flagged region or audio window, and a reference to the raw bytes. A bounding-box audit trail is the difference between a scanner you can defend in an incident review and a scanner you cannot.

6. What to integrate this week

Three concrete actions that fit inside one sprint, in priority order.

Wire a scanner call into your riskiest single surface. For most teams that is user-uploaded images on the chat or avatar path. One endpoint, one threshold, one log line per call. An hour of work to prove the shape. Glyphward's free tier covers 10 scans/day — enough to build the integration and run it against a corpus of known FigStep samples.
Add a second scanner call on one agent surface. If you ship an agent that takes screenshots or fetches third-party URLs, that surface is the higher-severity attack vector. Same shape as step 1 but on the fetch-then-render path, with a stricter threshold because the source is attacker-controllable.
Publish your threshold policy internally. One page in your runbook: source → threshold → action (block, log, allow with flag). Teams that skip this step get into arguments about false-positive tuning during incidents. Teams that ship it argue during planning, which is cheaper.

If you want a self-serve drop-in for the above, get early access — free tier, 10 scans/day, no card required.

FAQ

Is this a replacement for my existing text PI scanner?

No — it is the other half. Keep your text PI scanner for text inputs and OCR/transcript output. Add a pixel-and-waveform scanner for the modalities the text scanner cannot see. Defence-in-depth across layers, not replacement.

How much latency does an inline scanner add?

For a single image at Glyphward's current pipeline, p95 is under 200ms, which is under most VLM call latencies by a comfortable margin. Audio clips scale with clip length; 30-second clips are the typical upper bound we tune against.

Why not just fine-tune the VLM or speech model to refuse?

Worth doing, but it is a policy-layer defence — it treats the output. The attacks in this threat model mostly compromise the input boundary: an attacker-controlled artifact reaches the model and the model does what it asked. Input-layer scanning is cheaper, faster to iterate, and independent of the model version you happen to be on this quarter.

Does the scanner see my users' images or audio?

Glyphward processes bytes in memory and returns a score and region. We do not train on customer inputs by default, and retention is documented on the privacy page. For regulated workloads the compare pages explain the running-both pattern with self-hosted LLM Guard on the text leg.

What if I only accept one modality — say, just images?

Then you only need the image pipeline. The threat model composes — add the audio pipeline the week you add a voice feature, not before. Don't pay for capability you don't use.