Compliance · OWASP LLM01:2025

OWASP LLM01:2025 prompt injection — closing the multimodal sub-category

When AppSec teams audit an LLM application against the OWASP Top 10 for LLM Applications 2025, LLM01 — Prompt Injection — is the first control they have to evidence. The 2025 revision of LLM01 broadens the category beyond the original 2023 framing: it now recognises that prompt injection arrives through images, through audio, through embedded resources, and through any tool result the model treats as an instruction. Every public self-serve defender — Lakera Guard, LLM Guard, Azure Prompt Shields, Promptfoo — handles the text portion of LLM01. None of them, as shipped, cover the multimodal portion. That gap is what an LLM01:2025 audit will fail you on if your application accepts image upload, voice input, or any non-text content from a user or a tool. Here is what the multimodal piece of the control actually requires, and the inference-time scanner pattern that closes it.

TL;DR

LLM01:2025 (see the canonical risk page on the OWASP GenAI Security Project) treats prompt injection as a single category with multiple delivery channels — direct text, indirect text via retrieved or fetched content, and explicitly multimodal inputs (images, audio, mixed media). A control that satisfies LLM01 must inspect every channel the model actually consumes, at the place the model consumes it. Text-side filters are necessary but not sufficient: by design, they do not see pixels or waveforms. Glyphward sits as the inference-time scanner on the multimodal half of the control — bytes in, 0–100 score and flagged region out — and runs alongside the text-side guard you already have.

What LLM01:2025 actually says about multimodal

The 2025 revision of LLM01 extends the 2023 description in two places that matter for product teams. First, the threat description names multimodal inputs explicitly: any input the model accepts as part of its context — text strings, encoded images, encoded audio, embedded resources, file references, or content blocks delivered by tools — is a candidate carrier for an injected instruction. Second, the mitigation guidance names input inspection at the modality level. It is not enough to filter the text portion of a request and let the rest through; controls have to apply to whatever the model actually receives.

That is a small wording change with a large compliance consequence. Under LLM01:2023, an audit that demonstrated a text PI filter on user-typed prompts plus a basic system-prompt isolation pattern usually cleared the control. Under LLM01:2025, the same evidence does not clear the control if the application also accepts image upload (almost every chatbot in 2026), voice input (every voice agent), screenshots from a coding agent, or a retrieved PDF in a RAG pipeline. Each of those is a delivery channel the 2025 description names.

The OWASP GenAI Security Project — the 2025 successor to the original LLM Top 10 working group — maintains the LLM01 risk page as the canonical reference. Auditors and security reviewers will read that page first; their evidence questions will track its structure. If your application has a multimodal surface, expect to be asked which control inspects each modality and at what point in the pipeline.

Why the multimodal sub-category was added in 2025

Three things changed between the 2023 list and the 2025 list:

Vision-language and audio-LLM models went mainstream. GPT-4o, Claude 3 and 4, Gemini 1.5+, and the open-weight Qwen-VL line all accept image and audio inputs as ordinary chat content. By 2024, accepting non-text inputs in production was the rule, not the exception.
The attack literature caught up. FigStep, AgentTypo, and the typographic-PI class established that an instruction rendered as pixels survives every OCR-based defence (see FigStep detection and AgentTypo detector). WhisperInject established the audio analogue (see WhisperInject detection). Indirect prompt injection via images, ported from the Greshake et al. 2023 framing into 2024 multimodal pipelines, became a routine red-team finding (see indirect prompt injection in images).
Existing public defenders did not extend. The text-PI scanners that cleared LLM01:2023 stayed text-only. The Sept–Nov 2025 acquisition of Lakera by Check Point pushed the leading text scanner upmarket rather than into modalities (see what Check Point buying Lakera means for self-serve AI-security buyers). The result was a public-control gap exactly where the 2025 list said one should not be.

The 2025 revision is the OWASP working group's response to that gap: name the multimodal channel explicitly so that an audit cannot quietly accept text-only evidence for a multimodal application.

The three multimodal delivery channels you have to cover

For a self-assessment against LLM01:2025, three multimodal channels recur in production AI applications. Each is a distinct evidence question.

1. Direct image upload. The user uploads a photo, a screenshot, a chart, or a meme. The image carries an instruction rendered onto its pixels — a FigStep-style anti-OCR overlay, an AgentTypo-style adversarial-glyph block, an attribute spoof, or a confusable visual prompt. Examples in scope: avatar SaaS (selfie-to-portrait), chatbots with image upload, support agents that accept screenshots of error states, content moderation pipelines, multimodal customer service. Per-product threat models in avatar SaaS, chatbots with image upload, and screenshot-reading agents.

2. Direct audio input. The user speaks. The audio carries either a spoken jailbreak that the STT pipeline transcribes faithfully, an inter-word carrier the transcript drops, or a WhisperInject-class out-of-band payload that the audio model decodes when the transcript-only filter sees nothing. Examples in scope: voice agents (telephony, in-app voice modes), audio-first chatbots, dictation assistants, transcript-then-act pipelines. Per-product threat model in voice agents, byte-level coverage in audio prompt-injection detection.

3. Indirect / tool-delivered multimodal content. The user does not upload anything; the model still sees image or audio bytes, because they came back from a retrieval, a tool call, an MCP server, or a fetched URL. Examples: a retrieved PDF in a RAG pipeline contains an embedded image with a FigStep payload (RAG pipelines); an MCP server returns a chart with an instruction overlay (MCP servers); a LangChain agent's tool call returns an image attachment with an injected instruction (LangChain agents). The 2025 LLM01 framing treats this as the same control: the bytes reach the model, so the bytes have to be inspected.

An LLM01:2025 self-assessment should produce one inspection-point answer per channel that applies to the application. "We do not accept user image upload" closes channel 1. "Our voice path uses STT and we filter the transcript" partially closes channel 2 — but only partially, because the transcript-side filter does not see WhisperInject-class carriers, and the auditor will probably ask. "We do not call multimodal-capable tools" closes channel 3. Anything else needs an active control.

Why text-side controls do not satisfy the multimodal sub-category

The argument that "we have a text PI scanner, so LLM01 is covered" fails on two grounds. First, by interface: a text PI scanner accepts strings. It does not accept PNG bytes or PCM-16 audio. Its API has nothing to score on a multimodal channel. Bolting an OCR adapter or an STT adapter in front of it converts the input to text, but the conversion is the very thing the attack defeats — see why every text-only scanner misses a 30-pixel PNG for the architectural form of that argument and building a prompt-injection scanner for voice agents for its audio analogue.

Second, by audit shape: LLM01:2025 evidence questions ask which control inspects each channel. An auditor will not accept "we feed the OCR output of every uploaded image into our text scanner" if the threat model includes adversarial-glyph attacks the OCR drops. They will ask whether the control reads the bytes, and if not, what the residual risk is. The honest answer for a text-side scanner with an OCR adapter is "high residual risk on the FigStep / AgentTypo class, mitigated only by post-hoc model behaviour monitoring," which is not a passing answer for a control whose purpose is pre-execution input inspection.

The same shape applies for audio. A text-side control that reads the STT transcript is not satisfying LLM01:2025 for audio input — it is mitigating a strict subset of the channel. The auditor's question is what reads the waveform, and "nothing" does not clear the control.

Coverage matrix against LLM01:2025 multimodal

For a buyer evaluating self-serve options against the multimodal sub-category specifically, the public-defender landscape sorts cleanly.

Tool	Text channel	Image channel	Audio channel	LLM01 multimodal evidence
Lakera Guard	Yes	No (as of public coverage)	No	Partial — text only
LLM Guard (OSS)	Yes	No (text-only by design)	No	Partial — text only
Azure Prompt Shields	Yes (Azure-gated)	Image moderation, not PI	No	Partial — text + content moderation
Promptfoo	Test harness, eval-time	Test harness, eval-time	Test harness, eval-time	Not an inference-time control
Glyphward	Run-both with text scanner	Yes — bytes in, score and region	Yes — bytes in, score and region	Multimodal-channel control

The "run-both" framing matters because LLM01:2025 does not require replacing the text scanner. It requires that every channel the model consumes is inspected. Most production setups use Lakera Guard or LLM Guard for text and Glyphward for image and audio — neither vendor competes for the other's channel. Side-by-side detail in Glyphward vs Lakera Guard, vs LLM Guard, vs Azure Prompt Shields, and vs Promptfoo; a self-serve pricing comparison at multimodal PI scanner pricing comparison.

Architecture for closing the multimodal half of LLM01

The shape of a control that clears LLM01:2025 multimodal is, deliberately, not novel. It is the same shape as a text PI scanner, applied to bytes:

Mount on input. Place the scanner on the boundary the model actually consumes from. For a chatbot with image upload, that is the upload handler before the vision API call. For a voice agent, that is the audio buffer before STT or before the audio-aware model. For a RAG pipeline, that is the loader middleware (pre-ingestion or retrieval-time). For an MCP host, that is the tool-result handler.
Score, do not block silently. Return a 0–100 score and the modality-tagged reason, not just a binary verdict. LLM01:2025 evidence is easier to defend with a continuous score and a tunable threshold than with an opaque "blocked / allowed" boolean. The threshold becomes a documented engineering parameter the auditor can read.
Source-aware thresholds. Trust user-uploaded content less than first-party content, and trust third-party retrieved content least of all. The same scan call with three different threshold bands documents three risk tiers cleanly.
Run-both with text. Keep the text scanner in front of the model. Add the multimodal scanner alongside it. The text channel is still a real channel, and the 2025 list still covers it as the original LLM01 sub-category.
Log every score for evidence. A SOC 2 / ISO 27001 / FedRAMP-aligned LLM01 evidence trail wants per-request scoring data. Glyphward's API returns a request ID and a score; logging the pair against the application's request ID is the audit-friendly default.

The byte-level scanning architecture this implements — CLIP embedding plus typographic head plus Tesseract OCR plus a curated payload corpus on the image side, and a waveform anomaly classifier plus a Whisper-small transcript filter on the audio side — is described in the multimodal prompt-injection threat model for AI product teams (2026) blog post. The five-step playbook there maps directly onto LLM01:2025 evidence questions.

How Glyphward fits

Glyphward is the inference-time multimodal scanner — bytes in, score and region out — that slots into step 1 of the architecture above. The HTTP contract is one POST per attachment: image bytes (or URL) or audio bytes; the response is a 0–100 score, the flagged region (bounding box for image, time window for audio), and a modality-tagged reason. The same contract is exposed through the multimodal LLM security API page; pricing is flat-rate self-serve at $29/mo Pro and $99/mo Team, with a free tier sized for prototyping (free-tier API). Audit-friendly defaults: the Pro and Team tiers ship with logging and per-request IDs that satisfy a typical LLM01 evidence trail without further engineering.

The integration is provider-agnostic. Whether the application calls Anthropic, OpenAI, Google Gemini, AWS Bedrock, or a local model, the scanner reads bytes — not the chat-completion API the bytes are about to flow into. That is what makes it a clean LLM01 control rather than a vendor coupling.

Get early access · See the API surface