Compliance · NIST AI 100-2e2025 · AML Taxonomy

NIST AI 100-2e2025 — multimodal prompt injection and jailbreak in the AML Taxonomy

NIST AI 100-2e2025 — Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations, finalised 24 March 2025 — is the canonical NIST research report AppSec leads cite when they want stable terminology for the field of adversarial AI. It is the document the NIST glossary at csrc.nist.gov/glossary sources its prompt-injection, direct-prompt-injection, indirect-prompt-injection, and jailbreak entries from. Each of those four terms has a verbatim glossary definition. None of the four definitions name modality. That silence is the operative point: an instruction rendered in pixels or in a waveform that diverts the model is the same taxonomy class as an instruction encoded in characters, just delivered through a channel a text-only control does not see. Here is the term-by-term mapping, the coverage matrix five public tools draw against the AML Taxonomy evidence question on multimodal channels, and the runtime control architecture that closes that question on image and audio inputs.

TL;DR

NIST publishes two documents that name prompt injection. NIST AI 600-1 (the Generative AI Profile of the AI RMF) is the policy / risk-management framing — Govern, Map, Measure, Manage. NIST AI 100-2e2025 (the AML Taxonomy) is the research / terminology framing — stable definitions, a hierarchy of attack classes, an attack lifecycle. The two cite each other; together they are how a US-headquartered AI startup talks about adversarial inputs in a way a NIST reviewer recognises. The AML Taxonomy gives four glossary entries directly relevant to a multimodal LLM application: prompt injection (the parent class), direct prompt injection (the user-channel sub-type), indirect prompt injection (the resource-control sub-type), and jailbreak (a separate direct-prompting attack class targeting refusal behaviour). All four definitions are quoted verbatim below; all four are mode-agnostic. Multimodal prompt injection (FigStep-class typographic image payloads, AgentTypo-class adversarial glyphs, WhisperInject-class audio carriers, indirect carriers in retrieved images and audio) is the same four taxonomy classes delivered through pixels and waveforms. Glyphward sits as the inference-time multimodal scanner — bytes in, score and region out — that closes the gap, alongside whatever text-side control already covers the text channel.

What NIST AI 100-2e2025 actually is

The full title is Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations. The publishing identifier is NIST AI 100-2e2025; the final version was published on 24 March 2025; the canonical PDF is hosted at nvlpubs.nist.gov/nistpubs/ai/NIST.AI.100-2e2025.pdf with DOI 10.6028/NIST.AI.100-2e2025. It is the second edition of the AML Taxonomy NIST first published in 2023 (NIST AI 100-2e2023); the 2025 edition expands GenAI coverage substantially, including the indirect-prompt-injection sub-category, jailbreak and misuse classes, and updated mitigations.

The document's role in the NIST AI portfolio is the terminology role, by NIST's own framing. AI RMF 1.0 is the framework for managing AI risks. AI 600-1 (the GenAI Profile, July 2024) specialises that framework for generative AI systems and names prompt injection inside the Information Security risk. AI 100-2e2025 supplies the underlying definitions for the attack classes those framework documents reference — and is, importantly, the document the NIST CSRC glossary itself cites as the source for the canonical NIST definitions of prompt injection, direct prompt injection, indirect prompt injection, and jailbreak. An AppSec lead writing a security architecture document who needs the canonical NIST-sourced wording for those four terms cites AI 100-2e2025; an audit reviewer asking "what does NIST mean by prompt injection?" lands on AI 100-2e2025 via the glossary.

The document covers the full field of AML, not only GenAI. The taxonomy partitions attacks across two large families — Predictive AI (PredAI) and Generative AI (GenAI) — with attack classes per family. PredAI gets evasion, poisoning, and privacy attacks; GenAI gets evasion, poisoning, privacy, and misuse, with prompt injection appearing as a class under the GenAI side of the taxonomy. The lifecycle dimension — training-stage attacks vs. inference-stage attacks — cross-cuts the family dimension, and prompt injection is at the inference stage. Jailbreak is treated as a separate direct-prompting attack class with its own glossary entry; it is adjacent to prompt injection in the taxonomy rather than a sub-type of it. The four glossary entries below are the load-bearing artefact for an AppSec or red-team team that wants to anchor terminology against the document.

The four verbatim AML Taxonomy definitions, quoted

These four glossary definitions are the canonical NIST wording. Each is published at csrc.nist.gov/glossary with its source line citing NIST AI 100-2e2025. Quoted verbatim:

"An attack which exploits the concatenation of untrusted input with a prompt constructed by a higher-trust party such as the application designer." — prompt injection (NIST AI 100-2e2025, via the NIST CSRC glossary)

"A direct prompting attack in which the attacker exploits prompt injection." — direct prompt injection (NIST AI 100-2e2025, via the NIST CSRC glossary)

"A type of prompt injection executed through resource control rather than through user-provided input as in a direct prompt injection." — indirect prompt injection (NIST AI 100-2e2025, via the NIST CSRC glossary)

"A direct prompting attack intended to circumvent restrictions placed on model outputs, such as circumventing refusal behaviour to enable misuse." — jailbreak (NIST AI 100-2e2025, via the NIST CSRC glossary)

Three observations make the four definitions operationally useful. The first is that the parent prompt injection definition is structural rather than payload-based — it describes a concatenation pattern between untrusted input and a higher-trust prompt, with no constraint on what form the untrusted input takes. The second is that the indirect-prompt-injection sub-type is defined by channel, not modality — the differentiator is "resource control rather than user-provided input," which captures any byte stream the model ingests via a third-party resource (a retrieved document, a tool result, a database row, an MCP tool response). The third is that jailbreak is taxonomically separate from prompt injection in this document — the jailbreak entry calls itself "a direct prompting attack" and names a different intent (circumventing restrictions on outputs to enable misuse). A multimodal application can be subject to all four of these classes simultaneously, and a control architecture that addresses only one of them documents only that one.

What the four definitions don't say — and why that's load-bearing for multimodal

Read the four definitions back-to-back: the words "image", "audio", "multimodal", "pixel", "waveform", and "vision" do not appear in any of them. That silence is the same shape as the modality silence on every other current authoritative reference — OWASP LLM01:2025 recognises a multimodal sub-category but the technique definition is mode-agnostic; EU AI Act Article 15(5) names "adversarial examples or model evasion" without modality scope; NIST AI 600-1 names prompt injection inside Information Security and is silent on modality; MITRE ATLAS AML.T0051 describes the technique as mode-agnostic and notes "multimedia" in the indirect sub-technique. AI 100-2e2025 is the same shape: the operative concept is the channel and the concatenation, not the bit pattern of the bytes.

The operationally useful read of that silence is the same across all five vocabularies. An instruction rendered as 30 pixels of typography that the vision encoder parses, an instruction encoded in the high-frequency band of a WAV that the audio encoder parses, a typographic glyph block embedded in a PDF page that the document loader extracts as an image, a deepfaked phone-call recording fed to a voice agent — each of these is a concatenation of untrusted input with a higher-trust prompt under the AML Taxonomy's parent definition. Whether it is direct or indirect depends on the channel, not the modality. Whether it is also a jailbreak depends on whether the intent is refusal-bypass.

The reverse read — "AI 100-2e2025 doesn't literally say image PI, so multimodal PI is out of scope" — is the read a careful AppSec lead and a careful auditor will both reject. The taxonomy's terminology is structural; multimodal payloads satisfy the structural definition; the burden is on a system owner to argue why a multimodal payload that meets the structural definition is not in the taxonomy class, not on the taxonomy to enumerate every possible carrier. Engineering literature outside NIST is also already settled on this read: FigStep (arXiv:2311.05608), AgentTypo, WhisperInject (arXiv:2405.20653), and the Greshake et al. indirect-injection paper (arXiv:2302.12173) are all routinely cited in 2025–2026 security architecture documents as instances of the same prompt-injection / jailbreak class.

How the four taxonomy classes land on multimodal channels

Translating the AML Taxonomy's mode-agnostic terminology into a working control architecture means walking each of the four glossary classes through the production system's actual input channels and asking which ones the existing control covers. The mapping below is the operational read for an AI startup or scale-up running a multimodal LLM application.

Prompt injection (parent) on image bytes. The user uploads a PNG containing a FigStep-style typographic instruction block, or an AgentTypo-class adversarial-glyph payload that is engineered to defeat upstream OCR. The model receives the bytes; the text-only scanner upstream sees nothing because no text exists as characters before the model parses the image. The structural definition is satisfied: the image bytes are untrusted input, the system prompt is the higher-trust party's prompt, the concatenation happens at the model's multimodal context window. Long-form treatment in FigStep detection, AgentTypo detector, and why every text-only scanner misses a 30-pixel PNG.
Prompt injection (parent) on audio bytes. The user speaks a payload into a voice agent, or uploads a WAV file with a WhisperInject-style audio carrier. STT decoding is lossy: the carrier is engineered to survive decode (or to encode out of the transcript altogether) but reach the audio-aware model. Same parent-class concatenation; same model surface; the text scanner against the transcript does not see the original payload because the original payload is bytes, not characters. See WhisperInject detection, audio prompt-injection detection, and building a PI scanner for voice agents.
Direct prompt injection on image and audio channels. The two cases above, when the adversary acts as a user of the application — that is, when the untrusted input arrives directly through the user-facing interface. The glossary definition of direct prompt injection is "a direct prompting attack in which the attacker exploits prompt injection," so the modality of the input is not a constraint of the definition; what makes it direct is the channel (user-provided input). User-uploaded images and user-spoken audio both satisfy this. The text-only Lakera Guard / LLM Guard / Azure Prompt Shields control layer is structurally blind to the bit pattern of these inputs.
Indirect prompt injection on image and audio channels. The user uploads a PDF; the loader extracts text cleanly; an image embedded in the PDF carries a typographic payload. Or the agent retrieves a third-party document containing image attachments. Or a voice-RAG corpus contains audio with a WhisperInject carrier. Or an MCP tool result contains image bytes from a community server. The glossary definition pivots on "resource control rather than user-provided input," which is exactly the channel shape of every one of these cases. Indirect carriers are the highest-leverage adversary surface because the user did not author the payload, so user-ID-based throttles do not apply. See indirect PI in images, RAG pipelines, and MCP servers.
Jailbreak on image and audio channels. A multimodal payload whose intent is to "circumvent restrictions placed on model outputs" — the glossary's wording — can be a FigStep image whose contents are a refusal-bypass instruction rather than a tool-misuse instruction. Or a WhisperInject audio carrier of a refusal-bypass payload. The taxonomy treats jailbreak as a separate class from prompt injection, so a control architecture that scopes only "prompt injection" can miss jailbreak findings on the same input channels and vice versa. The pragmatic engineering posture is that the input-side detection problem is the same shape (untrusted bytes that carry an instruction-vector for the model), even though the output-side intent and the taxonomy class differ.

The shape that recurs across all five rows is the same: the parent definition's concatenation pattern is satisfied by any byte stream the model parses, and the textual representation of that stream (OCR output, STT transcript) is a derivative — a derivative that adversarial-multimodal payloads are engineered to corrupt. A control architecture keyed to AML Taxonomy classes on multimodal applications has to inspect bytes directly, not just OCR / STT outputs.

Why a text-only AML control documents only a subset of the taxonomy classes

An AppSec lead presenting a security architecture document keyed to the AML Taxonomy typically delivers two artefacts: a class-coverage matrix that names each taxonomy class against the application's threat surface, and a control-evidence stream that demonstrates, per request, that the named control was exercised. Both artefacts compress badly when the application is multimodal and the control layer is text-only.

The first failure mode is the class-coverage matrix. If the matrix names "prompt injection (parent), direct, indirect, jailbreak" as in-scope classes for the application and the control inventory is text-only, the matrix is making a coverage claim on a strict subset of each named class. The parent definition is structural and doesn't constrain the input modality; a text-only control covers the text-channel instances of the structural definition and is silent on the image-channel and audio-channel instances. A reviewer reading the matrix against the four glossary definitions notices the gap immediately, because nothing in the four definitions allows a modality-scoped reading.

The second failure mode is the control-evidence stream. AML Taxonomy reviews — and the audit programmes that cite them — converge on the same evidence question: per request, did the named control inspect the input that satisfies the taxonomy class? For a multimodal application that means per-request inspection on the image channel and per-request inspection on the audio channel, with retention long enough to support the audit window. A text-only inspection point does not produce that evidence stream for the multimodal channels; pointing at upload-time content moderation as if it were the answer is the second-most-common evidence error after the modality-scoping read above (Azure's image moderation classifies what an image depicts; the AML Taxonomy classes care about whether the image carries an instruction-vector for the model, which is a different question — see the Azure Prompt Shields comparison).

The shape of the gap is the same as the gap visible in every other vocabulary in the multimodal-PI cluster. OWASP LLM01:2025 names a multimodal sub-category and asks for a control. EU AI Act Article 15(5) binds high-risk providers to prevent / detect / respond / resolve / control adversarial-input attacks. NIST AI 600-1 names prompt injection in Information Security. MITRE ATLAS catalogues the technique IDs. AI 100-2e2025 supplies the underlying terminology that every one of those documents leans on. An organisation subject to all five wants the same evidence stream readable against all five — which is the runtime control architecture below.

Coverage matrix against the four AML Taxonomy classes on multimodal channels

The same five public tools that recur across the multimodal-PI cluster recur in an AML-Taxonomy coverage matrix, with the evidence question keyed to the four glossary classes. The question per cell: does the tool inspect bytes on the modality named in the column for the taxonomy class named in the row, in a way that produces a per-request evidence stream?

Tool	Prompt injection — image	Prompt injection — audio	Indirect PI — image / audio	Jailbreak — image / audio	AML Taxonomy multimodal evidence
Lakera Guard	No (text-only as of public coverage)	No	No	Text channel only	Partial — covers text-channel instances of the taxonomy classes; multimodal channels uncovered
LLM Guard (OSS)	No (text-only by design)	No	No	Text channel only	Partial — Python library, text-channel only
Azure Prompt Shields	Image moderation, not adversarial-input detection	No	No	Text channel only	Partial — image moderation is a different attack class from prompt injection in the AML Taxonomy; documenting it as image-PI coverage is an evidence error
Promptfoo	Eval-time test harness (multimodal probes if you author them)	Eval-time test harness	Eval-time test harness	Eval-time test harness	Useful for pre-deployment evaluation against the taxonomy classes; does not run on production traffic
Glyphward	Yes — bytes in, score and region out, request-ID-keyed audit trail	Yes — bytes in, score and time-window region out	Yes — same scanner endpoint with source-trust threshold band	Yes — input-side detection covers the input link of any PI-then-jailbreak chain	Multimodal-channel runtime control with per-request evidence stream readable against all four taxonomy classes

The Azure Prompt Shields row deserves the same careful read it gets in the other compliance-mapped pages. The product offers two distinct controls: image moderation (which classifies content categories — NSFW, violence, hate) and prompt-injection / jailbreak detection (text-channel). Documenting the image-moderation control as image-PI coverage under the AML Taxonomy is an evidence error: image moderation is the answer to a different attack class entirely. An adversarial-glyph block can be benign for moderation and still inject the model. Long-form treatment in Azure Prompt Shields alternative (non-Azure) and vs Azure Prompt Shields.

The Promptfoo row is in the matrix for the same reason it is in the other compliance-mapped pages: Promptfoo ships strong evaluation harnesses against prompt injection and jailbreak, with multimodal probe support if you author the corpus, and it is the right tool for the pre-deployment evaluation step in an AML Taxonomy review. What Promptfoo is by construction not is a runtime control on a production request — it runs in CI, not on the inference path. The pragmatic production setup is Promptfoo at CI time exercising the application's image and audio test corpus against Glyphward (the scanner under test) and against the model behind it, and Glyphward at runtime gating the production request. See Promptfoo + multimodal scanning and vs Promptfoo for the YAML provider config that wires this up.

Architecture for a multimodal control mapped to the AML Taxonomy

The architecture below produces an evidence stream readable against the four glossary classes. Each step maps to a question an AML Taxonomy review asks; each answer is byte-level, request-ID-keyed, and modality-tagged so the same retained data satisfies parallel reviews against OWASP, EU AI Act, NIST AI 600-1, and MITRE ATLAS without rebuild.

Inspect bytes at the inference boundary. Mount the scanner where the model actually consumes — the upload handler before the vision API for image, the audio buffer before STT or before the audio-aware model for audio, the loader middleware for RAG (RAG), the tool-result handler for MCP (MCP), the screenshot-capture path for screen-reading agents (screenshot agents). The mount point is what makes the runtime control answer the same evidence question the AML Taxonomy classes are framed against — a structural concatenation between untrusted bytes and a higher-trust prompt — rather than a derivative question about the OCR or STT output of those bytes.
Score, modality-tag, and stable-ID every request. Return a 0–100 risk score, the flagged region (bounding box for image, time window for audio), and a modality-tagged reason. Persist a request ID. The score is the threshold-tunable engineering parameter; the reason is the AML-Taxonomy-class hint (e.g. "typographic instruction block in image" maps to direct PI on image; "image bytes from third-party tool result" maps to indirect PI on image; "refusal-bypass typographic instruction" maps to jailbreak on image); the request ID is the foreign key the SOC's existing incident-response stack joins on.
Source-trust threshold bands. Indirect-channel inputs (the AML Taxonomy's "resource control" sub-type) get a tighter threshold than direct-channel inputs (the user-provided-input sub-type). User-uploaded content from a paying tenant gets a different threshold from anonymous-tier upload. Third-party tool-result content gets the tightest threshold of all. A single scanner endpoint with three different threshold bands documents three risk tiers cleanly, and the source-to-threshold policy is the auditable artefact a reviewer reads. The sub-type distinction in the taxonomy is what justifies the policy structurally — the AML Taxonomy explicitly differentiates direct from indirect by channel, and the threshold policy mirrors that.
Quarantine and corpus feedback. A flagged input is quarantined to a request-ID-indexed queue rather than silently dropped. The security team triages, validates as true positive or false positive, and feeds true positives into the next probe-corpus rebuild. The feedback loop is the difference between a static AML control and a living one. The taxonomy's framing of attacks as a hierarchy with new sub-types added across editions (the 2025 edition expanded coverage substantially over the 2023 edition) is the same shape applied to the organisation's own corpus — new attack patterns enter the corpus as they are observed in production traffic.
Evidence stream readable against all five vocabularies in parallel. The same per-request scoring data, retained against stable request IDs and modality-tagged reasons, satisfies the runtime-control evidence question that AML Taxonomy reviewers, OWASP LLM01:2025-mapped audit reviewers, EU AI Act Article 15(5) conformity assessors, NIST AI RMF Measure / Manage reviewers, and MITRE ATLAS-aligned red-team report readers all ask. Building the evidence stream once and reading it against five vocabularies is the operational shape that makes a single multimodal scanner pay back across every standards-mapping document on the desk simultaneously, rather than producing parallel evidence streams that diverge over time.

The byte-level scanning architecture this implements — CLIP image embedding plus a typographic head plus Tesseract OCR plus a curated payload corpus on the image side, and a waveform anomaly classifier plus a Whisper-small transcript filter on the audio side — is described end-to-end in the multimodal prompt-injection threat model for AI product teams (2026) blog post. The five-step playbook there maps onto the runtime-control architecture above; the AML-Taxonomy-flavoured framing on this page is the same engineering plan keyed to the four glossary classes.

How Glyphward fits

Glyphward is the inference-time multimodal scanner — bytes in, score and region out — that slots into the runtime-control layer above. The HTTP contract is one POST per attachment: image bytes (or URL) or audio bytes; the response is a 0–100 score, the flagged region (bounding box for image, time window for audio), the modality-tagged reason, and a stable request ID. The same contract is exposed through the multimodal LLM security API page; pricing is flat-rate self-serve at $29/mo Pro and $99/mo Team, with a free tier sized for prototyping and pre-deployment evaluation (free-tier API). Audit-friendly defaults: per-request retention with stable IDs is the granularity an AML-Taxonomy-keyed control-evidence stream expects.

The integration is provider-agnostic. Whether the AI application calls Anthropic, OpenAI, Google Gemini, AWS Bedrock, or a self-hosted multimodal model, Glyphward reads the bytes — not the chat-completion API the bytes are about to flow into. That makes Glyphward a clean runtime answer to the AML Taxonomy classes on the multimodal channels rather than a vendor coupling. It is also what makes it stack cleanly with whatever text-side guard already covers the text channel — the run-both pattern recommended on every comparison page (Lakera, LLM Guard, Azure Prompt Shields, Promptfoo) is the same architectural shape an AML-Taxonomy review tends to recommend: text control on text bytes, multimodal control on image and audio bytes, evidence stream from both keyed to the same request ID.

For the pre-deployment evaluation layer, Glyphward's scanner is exercisable from the same Promptfoo CI suite that exercises text-channel probes against the application. The image and audio test corpus runs through the scanner in CI; results land against the same AML Taxonomy classes the runtime control will produce evidence against; the corpus shrinks the false-confidence gap between "passed CI" and "passed in production" by exercising the same scanner against the same byte streams. See Promptfoo + multimodal scanning for the YAML provider configuration.

Get early access · See the API surface