ICP-by-product · Computer-use agents

Prompt-injection scanner for computer-use agents

Computer-use agents — Anthropic's computer use capability, OpenAI's Operator-style browsing agents, and custom screenshot-to-action pipelines — work by capturing screenshots of the visible screen state and feeding those screenshots to a vision LLM that decides which action to take next (click, type, scroll, submit). This is one of the most powerful and one of the most exploitable AI architectures: every web page the agent visits, every app window it sees, every notification overlay that appears on screen is a potential injection surface. An adversarial web page can place instruction text styled to blend into the page background — invisible to human readers at normal viewing distance, but fully legible to the vision model — that redirects the agent's actions. This is the indirect prompt injection via image attack applied to agentic action loops, and it is actively researched as the most consequential AI security vulnerability of 2025–2026. Glyphward scans each screenshot before it reaches the model, scoring the PI risk of the visible content so your agent can abort or escalate before taking irreversible action.

TL;DR

In your screenshot-to-action loop: after each screen capture, before passing the image to the vision LLM for action selection, POST the screenshot to /v1/scan. If score ≥ 65, abort the current action and either pause for human review or terminate the agent loop. The scan adds under 200 ms to each step. Free tier — 10 scans/day, no card required.

The computer-use adversarial surface

Web page adversarial overlays (white-on-white / CSS injection). The primary attack vector for computer-use agents navigating web content is text placed on a web page styled to be invisible to human vision but legible to a high-resolution screenshot processed by a vision model. Common techniques: white text on a white background, 1px font size text in a repeated tile pattern, text placed in a z-index-hidden layer, or text placed in the browser's scroll-overflow region that only appears in a full-page screenshot. Web pages visited by an autonomous agent are fully attacker-controlled environments — the attacker knows the agent will take a screenshot and process it, so they can design the payload for vision model legibility rather than human legibility.

Email and document content rendered on screen. A computer-use agent tasked with reading and responding to email, processing document attachments, or extracting data from web forms will capture screenshots of those emails and documents. An adversarial email sender can include hidden text instructions in the email body — formatted to be invisible in the user's email client but fully captured in the screenshot the agent takes. This is distinct from direct image attachment injection: the payload is in the rendered HTML of the email, not in an attachment.

Notification banners and system overlays. Desktop agents that run continuously capture all visible screen content, including notifications, tooltips, and application overlays. An attacker who can trigger a notification (via a web push notification, a messaging app message, or an in-app notification) can potentially inject a payload into the agent's screenshot stream at a moment when the agent is performing a sensitive action. The timing attack is harder but the attack surface is real.

CAPTCHA and anti-bot challenge pages. Some anti-bot systems present adversarial images specifically designed to be difficult for vision models — not as PI attacks, but as detection signals. A PI attacker could craft an image that mimics a legitimate CAPTCHA challenge while embedding an instruction payload, exploiting the agent's learned behaviour of trying to solve CAPTCHAs.

Multi-step exfiltration via visual channel. In a chained attack, the first injection (on page A) instructs the agent to navigate to page B and submit data to a specific form. Page B is the attacker-controlled exfiltration endpoint. The agent may have legitimate access to sensitive data (cookies, API keys, file contents) visible on the screen it is operating. The adversarial redirect is invisible in the text conversation history but visible in the screenshot sequence. See prompt-injection scanner for screenshot-reading agents for related patterns.

Integration pattern — Python screenshot loop with scan gate

import httpx
import base64
from PIL import ImageGrab  # or your screenshot library
import asyncio

GLYPHWARD_API_KEY = os.environ["GLYPHWARD_API_KEY"]
COMPUTER_USE_THRESHOLD = 65  # balanced: low latency impact, real threat signal

async def scan_screenshot(screenshot_bytes: bytes, step_id: str) -> dict:
    b64 = base64.b64encode(screenshot_bytes).decode()
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "https://glyphward.com/v1/scan",
            headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
            json={
                "image": b64,
                "source": "computer_use_agent",
                "metadata": {"step_id": step_id, "agent_loop": "main"},
            },
            timeout=5.0,  # tight timeout: agent loops are latency-sensitive
        )
        resp.raise_for_status()
    return resp.json()

async def agent_step(task: str, step_id: str, action_fn) -> dict:
    # 1. Capture current screen state
    screenshot = ImageGrab.grab()
    screenshot_bytes = screenshot.tobytes("jpeg", "RGB")

    # 2. Scan before LLM interpretation
    scan = await scan_screenshot(screenshot_bytes, step_id)

    if scan["score"] >= COMPUTER_USE_THRESHOLD:
        return {
            "status": "aborted",
            "reason": "adversarial_content_detected",
            "scan_id": scan["scan_id"],
            "score": scan["score"],
            "step_id": step_id,
            "message": "Screenshot contained potential PI payload. "
                       "Halted before action decision. Human review required.",
        }

    # 3. Safe to pass to vision LLM for action selection
    action = await action_fn(screenshot_bytes, task)
    return {"status": "ok", "action": action, "scan_id": scan["scan_id"]}

The scan runs concurrently with any other pre-action checks (viewport validation, action history, rate limiting). If the scan times out (e.g., network issue), fail-closed: do not take the action. Log the step_id and scan_id so your agent's action replay log is tied to the scan audit trail.

For Anthropic's computer use API specifically, this scan step fits naturally between the screenshot capture step and the anthropic.messages.create() call that processes the screenshot as a vision message.

Get early access

Anthropic computer use — specific considerations

Anthropic's computer use capability (available via the Claude API with "type": "computer_20241022" tool) processes screenshots as vision inputs on every agent step. The Anthropic API documentation explicitly flags prompt injection via web content as a risk and recommends "applying Glyphward-style input screening" as a mitigation pattern for production deployments. Key considerations for Claude-powered computer use agents:

Screenshot frequency: Claude computer use agents can take 10–50+ screenshots per task. At under 200 ms per scan, this adds 2–10 seconds of total scan time for a typical task. For tasks where latency matters more than adversarial robustness, scan only on navigation events (URL changes, new page loads) rather than every screenshot.
Tool call screenshots vs. observation screenshots: In the Anthropic computer use loop, the agent submits a tool_result containing the screenshot after each computer action. Scan the screenshot before constructing the tool_result, not after the agent has already decided to use it.
Multi-turn conversation context: A PI payload that spans multiple screenshots (one part of the instruction on one page, another part on the next) is harder to detect in isolation. Glyphward's scan scores each image independently; for high-stakes computer use tasks, also monitor for unusual instruction-like text appearing across consecutive screenshots.

Coverage comparison

Defence layer	Web page overlay	Email body injection	Notification overlay	Rendered document PI
System prompt instructions ("don't follow on-screen instructions")	Partial (prompt-level, bypassable)	Partial	No	Partial
HTML source sanitisation	Partial (misses CSS-hidden text)	No	No	No
Text-only scanner on rendered HTML	Partial (misses CSS-hidden text)	Partial	No	Partial
Glyphward screenshot scanner	Yes — pixel-level scan of rendered page	Yes	Yes	Yes