Blog · Security Architecture · 2026-06-11

Agentic AI and multimodal prompt injection: why autonomous agents face a larger attack surface than chat models

A chat model that processes a bad image gives one bad response to one user. An autonomous agent that processes the same image may execute a dozen real-world tool calls — file reads, API requests, emails sent, code approved — before any human sees a result. The multimodal prompt injection gap is identical in both cases: text scanners miss pixel-domain payloads. What changes with agents is the blast radius, the number of injection entry points in the agent loop, and the trust-escalation dynamics in multi-agent orchestration. Closing the gap requires more than placing a scanner at the front door.

TL;DR

Chat models have one image input per turn; autonomous agents have image inputs at retrieval time, mid-loop screenshot capture, tool output parsing, and subagent output propagation. A text-only scanner at the initial user input covers none of the in-loop boundaries. A successful injection in an agent with tool access redirects not one response but every subsequent tool call until the session ends. The dedicated pages for the three highest-risk agentic surfaces are computer-use agent scanning, agentic RAG pipeline injection, and screenshot-reading agent injection; this post is the structural argument for why agentic deployment requires a different scanning posture.

1. What makes agentic systems structurally different

The term "agentic AI" covers a wide range of system architectures, but the features that create distinct security requirements relative to standard chat deployments are consistent:

Tool access with real-world side effects

Agentic systems are defined by their ability to take actions: write to a file system, make API calls, send messages, execute code, browse the web, control a computer's graphical interface. In a chat model deployment, the worst outcome of a successful prompt injection is a bad text response — incorrect information, an inappropriate reply, a disclosed system prompt. In an agentic deployment, the worst outcome is an autonomous series of real-world actions taken under the attacker's direction. The model is not just talking differently; it is doing different things.

This difference is what OWASP LLM06:2025 Excessive Agency formalises: the risk class that arises when an LLM with excessive tool access is compromised. The multimodal dimension of LLM06 specifically covers the case where the compromise arrives via a pixel-domain payload that text-only monitoring never sees.

Multiple image input boundaries per session

In a standard multimodal chat deployment, there is one image input event per turn: the user attaches an image, the model processes it, the model responds. Scanning at that boundary covers the full image attack surface for that turn. Agentic systems introduce additional image input events throughout the session lifecycle:

None of these input events is visible to a scanner positioned at the user-facing front door. A text-only scanner positioned there has even less coverage: it never saw the images, and now it also does not see the in-loop injection points. The coverage argument for multimodal scanning in agent systems is therefore even stronger than in chat deployments — there are more attack surfaces, not fewer.

Persistent state that carries contamination

Many agentic systems maintain persistent state between turns: a scratchpad, a memory store, a conversation history buffer. A successful injection that redirects the agent's behaviour in one turn can write to this persistent state — a self-reinforcing mechanism where the attacker's instruction survives beyond the turn where it was injected and continues to influence subsequent turns. A text-only monitor on the conversation history cannot identify this contamination if the original payload arrived in the pixel layer of an image that was never scanned.

2. Three attack chains that do not exist in non-agentic deployments

Attack chain 1: Computer-use agent via adversarial screen content

A computer-use agent — such as a system built on Anthropic's computer use API, GPT-4o with function calling over screenshot loops, or an open-source computer-use framework — operates by taking screenshots of the current desktop or browser state and deciding what action to take next. The agent's perception is visual: it reads the screen as a series of images and interprets them with a VLM.

An attacker who can control what appears on the screen the agent is browsing controls the agent's visual input. A web page that displays adversarial text — formatted in a font that appears innocuous to a human viewer but reads as a FigStep-class instruction to the VLM — can redirect the agent's tool calls. The agent follows the instruction it reads from the pixel layer, using the tools it has access to (browser navigation, file downloads, clipboard paste, form submission), and the text-only monitor on the agent's conversation history sees only the benign action log: "navigated to page", "clicked element", "submitted form".

The specific vulnerability profile of computer-use agent deployments — CSS-invisible overlays, navigation-event injection, terminal output hijacking — is detailed at computer-use agent prompt injection scanning. The same pixel-domain attack mechanics used in FigStep apply directly to this surface; the scanner must run on the raw screenshot bytes before the VLM call that interprets the screen state.

Attack chain 2: Persistent corpus poisoning in agentic RAG

A RAG-augmented agent that retrieves context from a document corpus faces a distinct attack geometry from a chat model that receives an image from a live user. In the chat model case, the attacker must deliver their payload to a specific user's session in real time. In the RAG agent case, a single successfully ingested adversarial document delivers its payload to every agent query that retrieves it — potentially across many sessions, many users, and an unbounded future time window.

The attack works at two layers simultaneously. The adversarial image in the document carries a FigStep-class pixel instruction that the VLM executes when it reads the retrieved context. At the same time, the document's CLIP-space embedding is crafted to rank highly on retrieval for the target query terms — ensuring the payload document surfaces in every relevant retrieval, not just occasionally. This is the OWASP LLM08:2025 vector embedding weakness applied to image payloads: the adversarial image crafts its embedding so that retrieval delivers it exactly when the attacker wants it.

A scanner positioned at the document ingestion pipeline catches this attack before the document enters the corpus. A scanner positioned only at the user input boundary will never see a retrieved document: the image arrives in the agent's context from the retrieval system, not from the user. The agentic RAG pipeline injection page details the specific scan placement, the LangChain and LlamaIndex integration points, and the per-document scan_id provenance record that satisfies SOC 2 CC6.6 evidence requirements for third-party data ingestion controls.

Attack chain 3: Trust escalation in multi-agent orchestration

Many production agentic systems use a multi-agent architecture: an orchestrator agent coordinates the overall task and delegates subtasks to specialised worker agents. The orchestrator receives worker outputs as internal context and treats them with higher trust than direct user input — because they are, nominally, the system's own intermediate reasoning rather than external attacker-controlled input.

This trust model creates a propagation vector for multimodal injection. A worker agent that is assigned to process documents, review code, or interact with visual interfaces receives image inputs as part of its task. If one of those images contains a FigStep or AgentTypo payload, the worker agent's behaviour is redirected. Critically, the worker's text output — which the orchestrator reads as trusted context — will reflect the redirected instruction. The orchestrator, which may never process images itself, receives the attacker's instruction embedded in what it believes is its own subagent's analysis.

The defence implication is that scanning cannot be confined to the orchestrator's user-facing input: every worker agent's image inputs require multimodal scanning, because a compromised worker is a compromised orchestrator. The full multi-agent trust model and how scanning placement must account for it is in the screenshot-reading agent injection page, which covers the related case of agents that delegate visual interpretation to specialised vision subagents.

3. Why text-only scanners are particularly dangerous in the agentic context

The structural blindness of text-only scanners to pixel-domain payloads is the same regardless of whether the deployment is a chat model or an autonomous agent. What changes is the consequence of deploying a scanner with that structural blindness in an agentic context.

False confidence at a non-representative input boundary

A text-only scanner placed at the user-facing entry point of an agent provides coverage of exactly one of the many image input boundaries the agent has. It provides no coverage of retrieval-time images, mid-loop screenshots, tool outputs, or subagent propagation. But its presence creates a documented control — "we have a prompt injection scanner on user inputs" — that can be cited in security reviews, SOC 2 audits, and compliance assessments. That documented control is accurate about what it covers and materially silent about what it does not cover, which in an agentic system is the majority of the attack surface.

Teams that have deployed text-only scanners for their agentic systems and cite that deployment as their PI defence posture have a much larger unscanned attack surface than teams running a standalone chat model with the same scanner. The agentic deployment is harder to secure, not easier, and a scanner that was adequate for the chat model is inadequate for the agent by construction.

The forensic dead end

When a conventional application is compromised, forensic analysis can identify the payload — the HTTP request that contained the SQL injection, the email that carried the malware attachment, the input that triggered the buffer overflow. The payload is in the log.

When an agentic system is compromised by a multimodal injection that was never scanned, the forensic record contains only the tool calls the agent made: "retrieved document ID 4721", "navigated to URL", "wrote to file /data/export.json". The pixel payload that triggered those actions was in the image bytes of document ID 4721 or the screenshot taken during URL navigation. If those bytes were never scanned and no scan record was created, the forensic trail ends at the tool call log. There is no artifact that shows what the VLM read from the pixel layer. Reconstructing the attack requires retrieving the original image bytes and running a scanner retroactively — which requires having preserved the raw bytes in a queryable audit log, a step that most agent frameworks do not implement by default.

Per-request scan records — scan_id, risk_score, flagged_region, image_hash — are the forensic artifact that makes this reconstruction possible without having to retrieve and re-run every image the agent ever processed. This is the audit evidence requirement that EU AI Act Article 15 cybersecurity controls and OWASP LLM Top 10 recommend for AI system input validation, and it only exists if a scanner runs on the raw bytes at the time of processing.

The OWASP LLM06 intersection

OWASP LLM06:2025 Excessive Agency defines the risk class that arises when an LLM with broad tool access is redirected by an injected instruction. Most LLM06 discussions focus on the text injection vector — a malicious system prompt or user message that instructs the agent to misuse its tools. The multimodal dimension of LLM06 is the case where the redirection arrives via a pixel-domain payload, which means the standard text-monitoring controls that LLM06 mitigations typically recommend are insufficient: you cannot monitor for the injected instruction if it was never in the text layer.

The combined LLM01 (prompt injection) plus LLM06 (excessive agency) risk — where the injection vector is multimodal and the consequence is unrestricted tool execution — is the highest-consequence risk class for agentic AI deployments. It is also the risk class for which text-only security tooling provides the least coverage, because the entry point (image or audio) and the consequence amplifier (agent tool access) both lie outside the scope of string-based classifiers. The multimodal LLM06 excessive agency page covers the specific scan integration patterns for LangGraph, CrewAI, and AutoGen-based agentic frameworks.

4. Defence architecture for agentic multimodal deployments

Securing an agentic system against multimodal prompt injection requires extending the three-layer defence stack described in the FigStep/AgentTypo/WhisperInject post to account for the multiple image input boundaries unique to agentic loops.

Scan at every image boundary in the agent loop, not just the entry point

The minimum scanning requirement for an agentic system is one multimodal scan at every point where image bytes enter the agent's context:

Human-in-the-loop escalation on high-risk scores

For agentic systems with irreversible tool access (file writes, external API calls, messages sent), a scan score above a high-risk threshold should trigger a human-in-the-loop checkpoint before the next tool call executes. In LangGraph-based agents, this is a conditional interrupt: if the scan gate node returns a score above the configured threshold, the graph pauses at the next node and surfaces the flagged image and score to an operator review queue before resuming. This is the agentic analogue of the fail-closed pattern for chat models; in the agent context, fail-closed means "pause and await human confirmation" rather than "return an error to the user", because the agent has ongoing state that should not be abandoned mid-task without operator visibility.

Per-action audit records linking scan_id to tool call

The forensic gap described in section 3 is closed by linking each scan record to the subsequent tool call it gated. A per-action audit record contains: scan_id (from the multimodal scanner), image_hash (SHA-256 of the raw bytes scanned), risk_score at scan time, the tool call that followed (action name, parameters), and a timestamp. This record can be written to an append-only audit log independently of whether the scan flagged anything — clean scans create records too, which is what enables retroactive forensic reconstruction if a novel attack variant is later discovered in the payload corpus and you need to determine whether any prior session was exposed.

This audit architecture is also the evidence format that satisfies NIST AI RMF MAP 5.2 adversarial-input logging requirements, SOC 2 CC7.2 anomaly monitoring controls for AI systems, and EU AI Act Article 15 robustness-and-security record-keeping — all of which require documented evidence that inputs to high-risk AI systems were validated, not just an assertion that a scanner was deployed.

The scan placement decision: real-time vs. pre-ingestion

For retrieval-time images, there is a choice between scanning at ingestion (before the document enters the corpus) and scanning at retrieval (when the document is fetched for a specific agent query). Both have advantages. Ingestion-time scanning catches poisoned documents before they can be retrieved by any session; retrieval-time scanning catches documents that were clean when ingested but whose payload was only recognised after the payload corpus was updated. The real-time vs. batch scanning decision guide covers this trade-off in detail; for high-assurance agent deployments, the standard recommendation is both — ingestion gate to prevent corpus poisoning, retrieval gate to catch anything the ingestion scan missed with an older payload signature set.

5. Practical starting point for a production agentic deployment

If you are shipping an agentic system that accepts image inputs — document processing agents, computer-use automation, screenshot-reading assistants, voice agents with image tools, multi-agent pipelines with visual subagents — the minimum viable multimodal security posture involves three concrete changes to your current architecture:

  1. Audit your image input boundaries. Map every point in your agent loop where image bytes enter the agent's context. This is not just the user-facing entry point; include retrieval steps, screenshot capture points, tool outputs, and any inter-agent message passing. The audit is usually a two-hour exercise for a well-understood agent graph. What it typically reveals is three to five additional image input boundaries that no current scanning covers.
  2. Place a multimodal scan gate at each boundary. At each identified boundary, add a pre-VLM-call scan step using a scanner that operates on raw image bytes. The scan should return a risk score and a flagged region. Set a threshold appropriate to the action the agent takes if the scan passes — lower thresholds for tool calls with irreversible consequences, higher thresholds for read-only operations. For teams evaluating this, Glyphward's free tier (10 scans/day, no card required) at Glyphward pricing covers evaluation of all identified boundaries including screenshot loops.
  3. Implement per-scan audit records. Write scan_id, image_hash, risk_score, and the subsequent tool call to an append-only log for every scan, regardless of score. This is a one-time implementation cost; it is the difference between a security posture you can evidence in an audit and one you can only assert.

For teams building on specific frameworks, the integration patterns for LangGraph (LangChain), AutoGen, and CrewAI are covered in the agentic RAG pipeline injection page; for Amazon Bedrock Agents specifically, see Bedrock Agents scanning; for coding assistant agents that process design mockups and screenshot context, see AI coding assistant context injection. The vision-language model security page covers the underlying VLM architecture argument for why the pixel layer is structurally the only position from which these attacks can be detected, independent of the agent framework layered above it.

FAQ

Does an autonomous agent need a different scanner than a chat model?

Yes, in two ways. First, a chat model receives a fixed set of inputs per turn; an agent receives images at multiple points in its loop — retrieved documents, mid-execution screenshots, tool outputs, subagent responses. A scanner placed only at the initial input boundary misses every image that enters the agent loop after turn one. Second, a successful injection in a chat model produces one bad response. In an autonomous agent, it can redirect tool calls and trigger a cascade of real-world actions before any human review step. The blast radius is structurally different; the scanning posture must account for it by covering every image input boundary in the agent loop.

Is scanning only at the initial user input sufficient for an agentic system?

No. Agentic systems introduce image inputs at multiple points that do not exist in chat model deployments: document retrieval during RAG lookups, screenshots taken by computer-use agents during execution, tool outputs returning image data, and subagent outputs in multi-agent orchestration. A scanner placed at the user-facing entry point has no visibility into any of these in-loop image input events. Each is a distinct attack surface; each requires its own scan gate.

What is the blast radius of a successful injection in an autonomous agent?

In a chat model, a successful injection produces one bad response to one user. In an autonomous agent with tool access, it can redirect every tool call the agent makes until the injected instruction is overridden or the session ends — file exfiltration, messages sent, code approved, data modified. The action log shows what the agent did, not why, because the pixel payload was never persisted. This is both the security risk and the forensic problem; per-scan audit records are the mitigation for both.

How do multi-agent systems extend the multimodal injection attack surface?

A worker agent that processes an injected image and is redirected will embed the attacker's instruction in its text output. The orchestrator receives that output as trusted internal context and acts on it without ever having seen the original image. This trust-escalation vector means even an orchestrator that processes no images directly can be compromised by injections entering via visually capable subagents. Scanning must cover every subagent's image inputs, not just the orchestrator's user-facing boundary.

Which OWASP LLM Top 10 entries map to agentic multimodal injection?

Three entries overlap. LLM01:2025 Prompt Injection covers the direct injection case — an attacker-controlled image in the agent's input redirects its behaviour. LLM06:2025 Excessive Agency covers the consequence amplification specific to agents: the injected instruction is executed with tool access, producing real-world irreversible actions. LLM08:2025 Vector and Embedding Weaknesses covers the RAG corpus poisoning vector: an adversarial image crafts its CLIP embedding to rank highly in retrieval, delivering the payload to the agent on every relevant query.

Further reading