ICP-by-product · Screenshot-reading agents

Prompt-injection scanner for screenshot-reading agents

A screenshot-reading agent does not read your prompt. It reads what is on the screen and acts on what it reads. Once a vision-language model is in the loop, every pixel inside the bounding rectangle is an instruction the model can choose to follow — and an instruction the attacker can choose to put there. The attack surface is the screenshot itself.

TL;DR

If your agent looks at a UI to decide what to do — Computer Use, screen-aware copilots, design-to-code from a screenshot, automated UI testing with VLMs — you are running a vision-language model on adversary-controlled imagery. Prompt-engineering the system message does not help, because the attack lives in the input the model is supposed to perceive, not in the prompt frame. The defence is a pixel-level scan before the VLM call, in addition to anything you already do at the text layer.

Why screenshot agents are uniquely exposed

Two properties of the screenshot-agent stack converge into a sharp attack surface. First, the agent is invited to read text rendered as pixels — that is the entire point. So a defender cannot block on the presence of rendered text without breaking the product. Second, the source of the imagery is whatever the agent is looking at: a website it visits, a tab it captures, a document a user uploads, a screenshare from a third party. Any of those sources can be controlled by an adversary, and the agent has no native way to tell pixels from a benign UI apart from pixels carrying an injection payload.

This is the indirect-prompt-injection threat model first laid out by Greshake et al. in arXiv:2302.12173, sharpened for the multimodal era. The text-only version of the attack required the agent to fetch a poisoned document; the screenshot-agent version requires only that the adversary can place pixels somewhere the agent will look. See indirect prompt injection (image) for the historical arc.

Four attack patterns that fire on screenshot agents

  1. FigStep-style typographic payloads. A jailbreak rendered as a numbered list inside a small image element on a page the agent visits. The VLM reads the list as instructions; OCR-first defenders see a low-resolution font and drop or misread it. FigStep detection goes deep on the mechanism.
  2. AgentTypo and Unicode confusables. Letters distorted, anti-OCR fonts, Cyrillic homoglyphs rendered as pixels. OCR mis-transcribes; the VLM still pools the semantic embedding. The defender that scans the OCR output sees clean text where the model sees a payload. See AgentTypo detector.
  3. Adversarial UI elements. A button that says “Approve transfer” to a human and renders in pixels that an OCR-first defender reads as “Cancel”, while the VLM reads the visual cue (button colour, position, surrounding context) as the action the user wanted. The disagreement lives in the gap between OCR output and visual semantics.
  4. Composite indirect injection. A poisoned screenshot containing a benign-looking instruction that the agent follows in good faith — “close this dialog” — but where the dialog conceals an instruction encoded as a small typographic block elsewhere in the frame. Multiple low-individual-confidence signals stacked into a high-confidence payload.

The four share a structural feature: the artefact a text-based defender reads (the OCR transcript, or the agent’s narrated plan) is not the artefact the model acts on (the pixels themselves). Defending the wrong artefact wastes effort on every one of them. See typographic prompt injection scanner for the umbrella view.

What you cannot defend with prompt-engineering alone

The most common first line of defence is to add a system message that tells the agent to ignore instructions that appear inside the imagery it perceives. That is necessary; it is not sufficient. Recent work has consistently shown that VLMs trained on web-scale image-text pairs treat instruction-shaped pixels as instructions even when explicitly told not to — the in-distribution prior outweighs the out-of-distribution policy override under realistic distractor density. The system message reduces the success rate; it does not collapse it. For the residual risk, you need a detector that does not depend on the model’s cooperation.

The second-most-common line of defence is to OCR the screenshot first, run a text PI filter on the OCR output, and let the VLM see the original. This stack ships in production today and is broken in the way AgentTypo was designed to break: any disagreement between OCR and the VLM is the attack’s job to maximise.

Scanner architecture: pixel-level inspection before the VLM call

The architecture that closes the screenshot-agent gap runs four signals in parallel on the raw image:

  1. Text-in-image likelihood head. Predicts whether the image contains instructional text — numbered lists, imperative verbs, command-shaped layout — without needing to read individual letters. Fires on FigStep and AgentTypo output regardless of font.
  2. Visual-embedding nearest-neighbour over a known-payload corpus. CLIP-style embeddings compared against a compounding corpus of seen multimodal PI payloads. Catches paraphrases, font-swaps, and resolution changes that still land in the same neighbourhood.
  3. OCR with confusable normalisation. Still useful — but normalised (Cyrillic → Latin, diacritics stripped) before downstream matching, so confusables cannot mask a string.
  4. Perturbation-signature classifier. Detects high-frequency artefacts characteristic of adversarial-glyph attacks. Fires on AgentTypo even when the text content is benign.

Two signals firing above threshold is a hard block; one is cause to route to a human reviewer or downgrade the agent to read-only. The corpus compounds: every confirmed payload one tenant scans becomes a near-neighbour signal for every other tenant.

Integration recipe for a screenshot-agent stack

  1. Capture screenshot as you do today.
  2. POST to /v1/scan with the image bytes and the source-trust level (own UI, user upload, third-party page, screenshare).
  3. Apply policy on the score. Block ≥80; block + log ≥60 from low-trust sources; pass to the VLM otherwise. Source-aware thresholds matter — a screenshot of a third-party page should be held to a tighter score than a screenshot of your own UI.
  4. Forward to the VLM only on pass. Cache the score with the image hash so a re-scan is free for the next hour.

Because the scan runs before the VLM call, the latency adds to the time-to-action of the agent, not to the model’s reasoning budget. On a typical screen capture the marginal latency lands inside the slack the agent already has between capture and tool call.

How Glyphward fits

Glyphward’s `/v1/scan` accepts an image and returns a 0–100 risk score, modality flag, the bounding region of the flagged pixels, and the per-signal confidences. Drop it between your screenshot capture and your VLM call. The widget at /embed/preview demonstrates the upload-and-score flow against the public sample set; production calls go server-side. Free tier: 10 scans a day, no card. Pro: 100,000 scans/month at $29. Team: 1,000,000 at $99. See pricing or the vendor comparison.

Get early access

Related questions

Does this matter if the agent only screenshots my own UI?

Yes, but the threshold can be looser. A screenshot of your own product is high-trust; a screenshot of a third-party page or a user-uploaded document is low-trust. Apply source-aware thresholds — see indirect prompt injection (image) for the framework.

Can I rely on the VLM to refuse instructions inside images?

Treat it as one signal, not the only signal. Recent VLMs follow image-borne instructions at meaningful rates even with explicit system-message overrides. The detector is the layer that does not depend on the model’s cooperation.

How does this compare to Lakera, LLM Guard, or Azure Prompt Shields for screenshot agents?

Those are text-side scanners; running them on the OCR output or the agent’s plan still leaves the pixel layer uninspected. Use Glyphward in addition, on the image bytes themselves. Honest comparisons at vs Lakera Guard, vs LLM Guard, and vs Azure Prompt Shields.

What about agents that take screenshots of arbitrary websites?

That is the highest-risk variant of the class — the screenshot is, by definition, adversary-controlled. Run the scan with low-trust thresholds, log every flagged region, and surface flagged pages to a human reviewer queue. Source-trust should default to “untrusted”.

Does this slow down a Computer Use-style agent that captures every few seconds?

The scan is asynchronous and cacheable by image hash. For repeated captures of the same screen, only changed regions need a fresh score. Production tenants typically batch on a 1–3 second cadence with no perceptible slowdown.

Further reading