History + defence · Image injection

Indirect prompt injection via images

Indirect prompt injection is the class where the payload reaches the model from a third party — a fetched webpage, an uploaded document, a screenshot someone else sent. The image variant is where most of the real-world incident volume has landed since 2024, because images are the payload-carrying modality your text scanner cannot inspect.

TL;DR

Greshake et al. (2023) formalised indirect prompt injection: the attacker does not talk to the model — the victim hands the model attacker-controlled data and the model acts on its instructions. The image variant took that template, moved the payload into rendered pixels, and escapes every text-only defender. Detection lives in the image bytes, before the model ingests them.

A short history of the class

  1. 2023 — Greshake et al., “Not what you’ve signed up for” (arXiv:2302.12173). The paper that named and categorised indirect prompt injection. The examples were text (poisoned webpages, email threads, document contents); the structural argument generalised to any modality the model ingests from untrusted sources.
  2. 2023 — FigStep (arXiv:2311.05608). First widely-cited image-modality jailbreak. Instructions rendered onto an image, paired with a polite text prompt. Works against GPT-4V, Gemini, LLaVA variants.
  3. 2024 — AgentTypo and the adversarial-typography line. Anti-OCR distortions designed to defeat the OCR-first defender pattern specifically. Public writeups and follow-up papers through 2024–2025.
  4. 2024–2025 — Screenshot-reading agents take the hit. As IDE agents, browser agents, and avatar-pipeline tools matured, screenshots from third parties became a live attack surface: a Slack thread, a Jira ticket, an email render — all legitimate inputs to an agentic workflow, all attacker-controllable.
  5. 2025–2026 — Industry catches up at the high end. Lakera, acquired by Check Point, pushed upmarket; Azure Prompt Shields added some image handling for Azure-native tenants; LLM Guard stayed text-only by design. The self-serve SMB tier remained uncovered — which is the gap Glyphward exists to fill.

Why the image variant is worse than the text variant

The text variant of indirect PI is bad. The image variant is worse for three reasons: (1) your existing text PI scanner does not see it, so your current defence-in-depth is shorter than you think; (2) the payload is legible to the VLM at distortion budgets that obliterate OCR, so the naive fix (OCR → text scanner) has a ceiling below the attack; (3) images are the most common attacker-controllable modality in agentic workflows — screenshots, embedded web images, uploaded profile photos, rendered PDFs — so the attack surface is wide and growing.

If you run an agent that takes screenshots, accepts image uploads, or renders third-party pages into the model’s context, indirect image PI is a live risk on the current stack. Not a future problem.

How to detect indirect image PI at inference time

The defence is the same as for any typographic prompt injection — inspect the pixels, not the recovered text — with one additional twist: source-aware thresholds. Images that arrived from attacker-controllable surfaces (third-party URLs, user uploads, screenshots of external webpages) should be scanned with a stricter threshold than images produced in your own trusted pipeline (e.g., a chart your own backend rendered). Glyphward returns a raw 0–100 score; your integration decides which thresholds to apply per source.

The four-signal ensemble covered in the typographic PI scanner overview is the core: OCR with confusable normalisation, an instruction-layout classifier, visual-embedding nearest-neighbour over a shared corpus, and an adversarial-perturbation detector. For indirect-PI specifically, the bounding region the scanner returns lets you log evidence and show auditors what fired on which piece of which image — a requirement that is easier to meet with a score-and-region output than with a boolean.

How Glyphward fits

Glyphward’s API accepts an image and returns risk score + flagged region. Wire it into the image-ingest path of your agent — before the VLM call — and apply source-aware thresholds. Free tier at 10 scans/day, Pro at $29/mo for 100k scans, Team at $99/mo for 1M. See how the integration looks, compare to market incumbents on Lakera alternative (multimodal), or join the waitlist.

Get early access

Related questions

Is “indirect prompt injection” only a problem for agents?

Chat apps get it too, whenever a user pastes in an image they did not author. Agents amplify the problem because agents fetch images themselves — webpage screenshots, document renders, profile photos from external URLs — so the untrusted-input surface is wider and the user has less visibility into what the model saw.

Can I just block images from untrusted sources?

If your product allows it, blocking is the strongest mitigation. Most products cannot — images from users and third parties are the feature. In that case you need a scanner in the path. Block-what-you-can and scan-the-rest is the sustainable policy.

How is this different from jailbreaking?

Jailbreaking targets the model’s policy refusals; indirect PI targets the trust boundary around the model’s input. The techniques overlap (an indirect PI payload often contains a jailbreak instruction) but the defence layer is different: jailbreak detection lives near the policy model; indirect PI detection lives at the input boundary. Glyphward is an input-boundary scanner.

Further reading