Red team · MITRE ATLAS · AML.T0051 / AML.T0054

MITRE ATLAS — multimodal prompt injection (AML.T0051) and LLM jailbreak (AML.T0054) for AI red teams

MITRE ATLAS — the Adversarial Threat Landscape for AI Systems — is the ATT&CK-style technique catalog AI red teams use to scope engagements and AI security teams use to map findings to a stable identifier. ATLAS catalogs LLM Prompt Injection as AML.T0051, with two sub-techniques (Direct, AML.T0051.000; Indirect, AML.T0051.001), and catalogs LLM Jailbreak as AML.T0054. Each technique has a one- or two-sentence verbatim description in the catalog. None of those descriptions name modality. That silence is the operative point: an instruction rendered in pixels or in a waveform that diverts the model is the same T0051 / T0054 class as an instruction encoded in characters, just delivered through a channel a text-only red-team scope and a text-only runtime guard do not see. Here is the technique-by-technique mapping, the coverage matrix five public tools draw against the ATLAS evidence question on multimodal channels, and the runtime control architecture that closes T0051 / T0054 on image and audio inputs.

TL;DR

ATLAS gives you four technique IDs to argue against on a multimodal LLM application. AML.T0051 is the parent (LLM Prompt Injection). AML.T0051.000 is the direct-channel sub-technique. AML.T0051.001 is the indirect-channel sub-technique — the one that lands when the model ingests a tool result, a retrieved document, an image embedded in a PDF, or an audio file from a third-party source. AML.T0054 is the LLM Jailbreak technique, defined in the catalog as "a carefully crafted LLM Prompt Injection designed to place LLM in a state in which it will freely respond to any user input." All four are mode-agnostic in the catalog. A red-team engagement scoped only to text inputs documents coverage on a strict subset of the technique. A runtime guard wired only to text inputs runs the same mistake. Multimodal prompt injection (FigStep-class typographic image payloads, AgentTypo-class adversarial glyphs, WhisperInject-class audio carriers, indirect carriers in retrieved images and audio) is T0051 / T0054 delivered through pixels and waveforms. Glyphward sits as the inference-time multimodal scanner — bytes in, score and region out — that closes the gap, alongside whatever text-side guard already covers the text channel and whichever red-team probe set already exercises the text path.

What MITRE ATLAS actually catalogs about prompt injection and jailbreak

ATLAS is structured the same way ATT&CK is: a matrix of tactics across the top, techniques and sub-techniques in cells beneath, and per-technique pages with descriptions, mitigations, and procedure examples. The four LLM-prompt-injection-shaped entries are catalogued under multiple tactics — that is, the same technique appears in more than one matrix column because it can be the adversary's foothold (Initial Access), survive subsequent operations (Persistence), elevate privilege inside the application sandbox (Privilege Escalation), or evade safety controls (Defense Evasion). The verbatim catalog descriptions are the load-bearing artefact.

"An adversary may craft malicious prompts as inputs to an LLM that cause the LLM to act in unintended ways. These 'prompt injections' are often designed to cause the model to ignore aspects of its original instructions and follow the adversary's instructions instead." — ATLAS technique AML.T0051 (LLM Prompt Injection)

"An adversary may inject prompts directly as a user of the LLM. This type of injection may be used by the adversary to gain a foothold in the system or to misuse the LLM itself." — ATLAS sub-technique AML.T0051.000 (LLM Prompt Injection: Direct)

"An adversary may inject prompts indirectly via separate data channel ingested by the LLM such as include text or multimedia pulled from databases or websites." — ATLAS sub-technique AML.T0051.001 (LLM Prompt Injection: Indirect)

"An adversary may use a carefully crafted LLM Prompt Injection designed to place LLM in a state in which it will freely respond to any user input, bypassing any controls, restrictions, or guardrails placed on the LLM." — ATLAS technique AML.T0054 (LLM Jailbreak)

The Indirect sub-technique description is the one most often misread on a first pass. The phrase "include text or multimedia pulled from databases or websites" is the catalog's own admission that the indirect-channel payload need not be text. Pixels in an image returned by a tool call, audio bytes in a transcript-server response, screenshots passed to a screen-reading agent — every one of those falls inside the verbatim description's scope. The catalog's wording is mode-agnostic and the operationally useful read is that any byte stream the model consumes is in scope, not just text decoded from those bytes.

AML.T0054 is the second important catalog read. The technique definition explicitly grounds jailbreak as a specialisation of prompt injection — "a carefully crafted LLM Prompt Injection designed to place LLM in a state…" That phrasing means a red-team engagement that scopes prompt injection (T0051) without scoping jailbreak (T0054) under-counts, and one that scopes jailbreak without scoping prompt injection technically over-counts. The two are not parallel siblings in the catalog; T0054 is a downstream goal of T0051 inputs. A multimodal payload that walks past the text scanner and lands as a jailbreak is one technique chain — T0051.001 then T0054 — and a defender-side control that does not see the input bytes does not interrupt either link.

How the four ATLAS technique IDs land on multimodal channels

The catalog text is silent on modality. Translating that silence into a working red-team scope and a working runtime control means walking each technique ID through the production system's channels and asking which ones the existing scope and control actually exercise. The mapping below is the operational read for an AI startup or scale-up red-teaming a multimodal LLM application — not a literal reproduction of the catalog page.

AML.T0051.000 (Direct) on the image channel. The user uploads a PNG containing a FigStep-style typographic instruction block, or an AgentTypo-class adversarial-glyph payload that is engineered to defeat upstream OCR. The model receives the bytes; the text-only scanner upstream sees nothing because no text exists as characters before the model parses the image. The technique chain is T0051.000 — a direct-channel adversary acting as a user. The text-only guard's coverage of T0051.000 on the image channel is structurally zero. Long-form treatment in FigStep detection, AgentTypo detector, and why every text-only scanner misses a 30-pixel PNG.
AML.T0051.000 (Direct) on the audio channel. The user speaks a jailbreak payload into a voice agent, or uploads a WAV file with a WhisperInject-style audio carrier. STT decoding is lossy: the carrier is engineered to survive decode (or to encode out of the transcript altogether) but reach the audio-aware model. Direct-channel T0051 with audio bytes; text scanner against the transcript does not see the original payload. See WhisperInject detection, audio prompt-injection detection, and building a PI scanner for voice agents.
AML.T0051.001 (Indirect) on the image channel. The user uploads a PDF; the loader extracts text cleanly; an image embedded in the PDF carries a typographic payload. Or the agent retrieves a third-party document containing image attachments. Or a tool call returns image bytes. The technique chain is T0051.001 — indirect channel, retrieved or third-party content — exactly within the catalog's "text or multimedia pulled from databases or websites" scope. Indirect carriers are the highest-leverage multimodal adversary surface because the user did not author the payload, so user-ID-based throttles do not apply. See indirect PI in images, RAG pipelines, and MCP servers.
AML.T0051.001 (Indirect) on the audio channel. A voice-RAG corpus carries WhisperInject payloads in the source audio; a transcript MCP server returns AudioContent that hosts a carrier; a deepfaked phone-call recording lands in the agent's tool-result channel. T0051.001 with audio. Same indirect-channel risk as image, different modality. The control surface is the same — inspect the bytes before they reach the audio-aware model.
AML.T0054 (Jailbreak) on the image and audio channels. A multimodal payload whose intent is to "place the LLM in a state in which it will freely respond" — the catalog's own framing of T0054 — can be a FigStep image whose contents are a jailbreak prompt rather than a tool-misuse prompt. Or a WhisperInject audio carrier of a refusal-bypass instruction. Or an indirect-channel image-with-jailbreak retrieved by a RAG agent. T0054 is downstream of the T0051 inputs above; the catalog's definition makes that explicit. A red-team scope that probes T0054 only on the text channel reports a strict-subset finding.

The shape that recurs in every row above is the same: the technique applies on any byte stream the model parses, and the textual representation of that stream (OCR output, STT transcript) is a derivative — a derivative that adversarial-multimodal payloads are engineered to corrupt. A red-team engagement that scopes T0051 / T0054 on multimodal applications has to cover image and audio bytes directly, not just OCR / STT outputs. A runtime control on the same techniques has to inspect bytes too, on the same architectural argument.

Why a text-only red-team scope is incomplete for AML.T0051 / AML.T0054 on multimodal systems

ATLAS-aligned red teams typically deliver two artefacts: a scope document that enumerates the technique IDs in play and a finding catalog that maps each finding back to those IDs. Both artefacts compress badly when modality is omitted from scope. The argument has two clean shapes.

The first is the scope-completeness argument. If the engagement contract says "AML.T0051 (LLM Prompt Injection), all sub-techniques" but the test corpus is text-only, the contract is being satisfied on a strict subset of the technique class. The catalog's verbatim description for T0051.001 names "multimedia pulled from databases or websites" inside the technique. A finding catalog that documents text-channel coverage of T0051.001 without exercising image or audio carriers is documenting a scope gap rather than a clean ATLAS coverage claim. The same shape applies to T0054 — the catalog grounds it as "a carefully crafted LLM Prompt Injection," which inherits the modality silence. Reviewers reading the report against the catalog notice the gap; remediation funds get held against the gap.

The second is the runtime-evidence argument. ATLAS findings frequently land on a recommendation to add or strengthen a runtime control — an inspection point that catches the technique on production traffic, not just in pre-deployment evaluation. Lakera Guard, LLM Guard, and Azure Prompt Shields are the three commonly recommended public defenders for the text channel. None of them is documented to inspect image bytes for adversarial-input content (Azure's image moderation is a different control class — see below). A recommendation that documents a runtime control for T0051 without naming a control on the multimodal channels of a multimodal application is documenting a remediation gap. Glyphward exists in the recommendation slot for the image and audio channels: byte-level inspection at the inference boundary, with a per-request score, modality-tagged reason, and request ID stable across the audit window.

Together, the two arguments read the same way an ATT&CK-aligned report on a traditional system reads when one MITRE ATT&CK technique is "scoped only on Windows endpoints" while the production estate runs Linux. The technique applies on every channel the production system consumes from; the report needs to cover every one of those channels or document the gap.

Adjacent compliance vocabularies converge on the same conclusion. OWASP LLM01:2025 explicitly recognises a multimodal sub-category. EU AI Act Article 15(5) names "adversarial examples or model evasion" without modality scope. NIST AI 600-1 names prompt injection inside the Information Security risk and is silent on modality. ATLAS is the red-team / threat-intel vocabulary; the other three are audit / compliance vocabularies. All four converge on the same control architecture for the multimodal piece, in different vocabularies. An organisation subject to all four — a US-headquartered AI startup with EU customers running ATLAS-aligned red teams against an OWASP-mapped review — wants the same evidence stream to satisfy the four documents at once, which is what the runtime control architecture below is designed to produce.

Coverage matrix against AML.T0051 / AML.T0054 on multimodal channels

The same five public tools that recur across the audit-prep cluster recur in a red-team coverage matrix, with a different evidence question. The question here: does the tool exercise or detect T0051 / T0054 on the modality named in the column? Eval-time and runtime tools draw the matrix differently, which the table makes explicit.

Tool	T0051.000 / .001 — text	T0051.000 / .001 — image	T0051.000 / .001 — audio	T0054 multimodal	ATLAS multimodal evidence
Lakera Guard	Yes (runtime, text inputs)	No (text-only as of public coverage)	No	Text channel only	Partial — T0051 / T0054 documented on text channel; multimodal channels uncovered
LLM Guard (OSS)	Yes (runtime, text-only by design)	No	No	Text channel only	Partial — Python library, text-channel only
Azure Prompt Shields	Yes (Azure-gated)	Image moderation, not adversarial-input detection	No	Text channel only	Partial — image moderation is a different MITRE-technique class from T0051; documenting it as T0051.000 image coverage is an evidence error
Promptfoo	Eval-time test harness with ATLAS-mapped probe templates	Eval-time test harness (multimodal probes if you author them)	Eval-time test harness	Eval-time test harness	Useful for pre-deployment Measure-equivalent against T0051 / T0054; does not run on production traffic
Glyphward	Run-both with text scanner	Yes — bytes in, score and region out, request-ID-keyed audit trail	Yes — bytes in, score and time-window region out	Yes — covers the multimodal-input link in the T0051 → T0054 chain	Multimodal-channel runtime control for T0051 / T0054 with per-request evidence stream

The Azure Prompt Shields row deserves a careful read. The product offers two distinct controls: image moderation (which classifies content categories — NSFW, violence, hate) and prompt-injection / jailbreak detection (text-channel). Documenting the image-moderation control as T0051.000 image coverage is an evidence error: image moderation classifies what the image depicts; AML.T0051 cares about whether the image carries an instruction-vector for the model. An adversarial-glyph block can be benign for moderation and still inject the model. This is the easiest evidence error for a red-team report to make and the easiest one for a careful reviewer to find. Long-form treatment in Azure Prompt Shields alternative (non-Azure) and vs Azure Prompt Shields.

The Promptfoo row is in the matrix because Promptfoo ships ATLAS-mapped probe templates and is a strong pre-deployment Measure-flavoured tool. What Promptfoo is by construction not is a runtime control on a production request — it runs in CI, not on the inference path. The pragmatic production setup is Promptfoo at CI time exercising the application's image and audio test corpus against Glyphward (the scanner under test) and against the model behind it, and Glyphward at runtime gating the production request. See Promptfoo + multimodal scanning and vs Promptfoo for the YAML provider config that wires this up.

Architecture for an ATLAS-aligned multimodal red team plus runtime control

The architecture below has two layers — the red-team probe layer that exercises T0051 / T0054 on every modality the production system consumes from, and the runtime-control layer that gates production traffic against the same techniques. Each layer answers a different ATLAS evidence question; both layers feed the same finding catalog.

Red-team probe layer (pre-deployment)

Scope explicit on technique ID and modality. The engagement contract names AML.T0051 (parent), AML.T0051.000 (Direct), AML.T0051.001 (Indirect), and AML.T0054 (LLM Jailbreak) explicitly, and lists the modalities in scope (text, image, audio, indirect channels via tool results / RAG retrievals / MCP servers). The list of modalities is enumerated from the production system's actual input surfaces — what the model parses, not what the user typed.
Probe corpus per technique × modality. Build (or reuse) a probe corpus per cell of the technique-times-modality grid. For T0051.000 image, that means FigStep-class typographic payloads at the resolutions the production system actually accepts, AgentTypo-class adversarial glyphs, and image carriers that defeat the upstream OCR. For T0051.001 image, image-bearing PDFs, scanned originals, retrieved-document fixtures. For audio, WhisperInject-class carriers, silence steganography, multi-speaker overlays. For T0054 multimodal, payloads whose semantic intent is refusal bypass rather than tool misuse. Promptfoo's red-team mode is a useful starting point; multimodal probes generally require authoring on top of the base templates.
Run probes through the production inference path, not a stripped variant. The most common evidence error in a multimodal red-team report is running probes through a "text equivalent" of the application — the OCR output of the image fed to the text scanner, rather than the image bytes through the actual production handler. The probe's value is whether it lands on the production parser; running it on a derivative invalidates the finding shape.
Findings indexed to T0051 / T0054 with modality column. Each finding in the report is keyed by ATLAS technique ID with a modality column. A landed-jailbreak finding via image carrier is AML.T0051.000 → AML.T0054, modality=image. The structure makes the gap visible to any reviewer who reads against the catalog.

Runtime-control layer (production)

Mount on input bytes. Place the scanner on the inference boundary the model actually consumes from — the upload handler before the vision API for image, the audio buffer before STT or before the audio-aware model for audio, the loader middleware for RAG (RAG), the tool-result handler for MCP (MCP), the screenshot capture for screen-reading agents (screenshot agents). The mount point is what makes the runtime control answer the same ATLAS evidence question the red-team probe layer answered against.
Score, modality-tag, and stable-ID every request. Return a 0–100 risk score, the flagged region (bounding box for image, time window for audio), and the modality-tagged reason. Persist a request ID. The score is the threshold-tunable engineering parameter; the request ID is the foreign key the SOC's existing incident-response stack joins on.
Differential trust by source. Indirect-channel inputs (T0051.001) get a tighter threshold than direct-channel inputs (T0051.000). User-uploaded content from a paying tenant gets a different threshold from anonymous-tier upload. Third-party tool-result content gets the tightest threshold of all. A single scanner endpoint with three different threshold bands documents three risk tiers cleanly, and the source-to-threshold policy is the auditable artefact a reviewer reads.
Quarantine and feedback to next probe corpus. A flagged input is quarantined to a request-ID-indexed queue rather than silently dropped. The security team triages, validates as true positive or false positive, and feeds true positives into the next probe-corpus rebuild. The feedback loop is the difference between a static red-team artefact and a living one. ATLAS's own framing — a "globally accessible, living knowledge base of adversary tactics and techniques" — is the same shape applied to the organisation's own technique catalog.
Evidence stream readable by ATLAS, OWASP, EU AI Act, and NIST audits in parallel. The same per-request scoring data, retained against stable request IDs and modality-tagged reasons, satisfies the runtime-control evidence question that ATLAS reviewers, OWASP-mapped audit reviewers, EU AI Act Article 15(5) conformity assessors, and NIST AI RMF Measure / Manage reviewers all ask. Building the evidence stream once and reading it against four vocabularies is the operational shape that makes a single multimodal scanner pay back across the red-team and the audit layers simultaneously.

The byte-level scanning architecture this implements — CLIP image embedding plus a typographic head plus Tesseract OCR plus a curated payload corpus on the image side, and a waveform anomaly classifier plus a Whisper-small transcript filter on the audio side — is described end-to-end in the multimodal prompt-injection threat model for AI product teams (2026) blog post. The five-step playbook there maps onto both the red-team probe layer and the runtime-control layer of this architecture; the layer split above is the ATLAS-flavoured framing of the same engineering plan.

How Glyphward fits

Glyphward is the inference-time multimodal scanner — bytes in, score and region out — that slots into the runtime-control layer above. The HTTP contract is one POST per attachment: image bytes (or URL) or audio bytes; the response is a 0–100 score, the flagged region (bounding box for image, time window for audio), the modality-tagged reason, and a stable request ID. The same contract is exposed through the multimodal LLM security API page; pricing is flat-rate self-serve at $29/mo Pro and $99/mo Team, with a free tier sized for prototyping and red-team probe development (free-tier API). Audit-friendly defaults: per-request retention with stable IDs is the granularity an ATLAS-aligned report's runtime-control recommendation expects.

The integration is provider-agnostic. Whether the AI application calls Anthropic, OpenAI, Google Gemini, AWS Bedrock, or a self-hosted multimodal model, Glyphward reads the bytes — not the chat-completion API the bytes are about to flow into. That makes Glyphward a clean runtime-control answer to AML.T0051 / AML.T0054 on the multimodal channels rather than a vendor coupling. It is also what makes it stack cleanly with whatever text-side guard already covers the text channel.

For the red-team probe layer, Glyphward's scanner is exercisable from the same Promptfoo CI suite that already runs ATLAS-mapped text-channel probes. The application's image and audio test corpus runs through the scanner in CI; findings index by technique ID with a modality column; remediation work feeds back into the scanner's corpus on the next release. See Promptfoo + multimodal scanning for the YAML provider configuration and vs Promptfoo for the eval-time-versus-inference-time category distinction in detail.

Get early access · See the API surface