Blog · Attack Deep-Dives · 2026-06-11
FigStep, AgentTypo, WhisperInject — the three multimodal prompt injection attacks every text scanner gets wrong
Three attacks — FigStep, AgentTypo, and WhisperInject — define the outer boundary of what text-only prompt injection scanners can see. They do not exploit a tuning failure or a blocklist gap. They exploit the input channel itself: each attack delivers its payload in the pixel array or the audio waveform, a layer that every text scanner ever built has never inspected. Understanding exactly how each attack works — and why the defensive failure is structural, not incidental — is the prerequisite for building a scanner that actually covers them.
TL;DR
FigStep renders injection instructions as glyphs inside an image that OCR misreads or skips; AgentTypo distorts characters so OCR produces a benign string while the VLM reads the toxic original; WhisperInject hides commands in audio at frequencies or segments that Whisper discards before its transcript reaches any text scanner. All three attacks share the same root cause: text-only scanners have no visibility into the non-text channel where the payload lives. Defending against all three requires a scanner that operates on raw image bytes and raw audio bytes before any preprocessing step runs. The dedicated scanner pages are FigStep detection, AgentTypo detector, and WhisperInject detection; this post is the technical argument for why text scanners miss all three and what the detection architecture has to look like.
1. FigStep: instructions the OCR layer never decodes
FigStep was documented in the 2023 paper "FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts" (Gong et al., arXiv:2311.05608). The attack exploits a structural asymmetry between how OCR systems and vision-language models read images containing text.
The mechanism
A FigStep payload is a prompt-injection instruction rendered as text inside an image. The key design decision is font selection: the authors showed that instructions rendered in stylised, decorative, or intentionally OCR-hostile fonts — heavy serifs at small point sizes, outlined text over patterned backgrounds, handwritten typefaces, or text embedded inside tables and charts — are decoded correctly by a VLM's visual encoder but returned as garbage or empty string by standard OCR tools (Tesseract, Google Cloud Vision OCR, AWS Textract).
The asymmetry exists because OCR systems are trained to recognise specific glyph shapes under the assumption that text should be machine-readable. VLMs are trained on the much broader distribution of how text appears across the entire visual web — advertisements, handwritten notes, stylised graphics, memes, screenshots, diagrams — so their visual encoders have substantially higher tolerance for degraded or stylised text. FigStep exploits this gap: the attack font is chosen to be outside the OCR training distribution while remaining inside the VLM training distribution.
How it bypasses text scanners
The typical text-only PI scanner sits at the application layer and scans the string content of user messages and LLM outputs. In a multimodal chatbot that accepts image uploads, the user message received by the scanner contains: the user's text query (usually benign — "what does this image say?" or "explain this diagram"), plus a reference to the uploaded image. The scanner runs its classifier on the text and returns a low risk score. The image — containing the actual payload — is passed to the VLM unchanged. The VLM reads the rendered instruction, infers the attacker's intent, and complies.
There is no text anywhere in the pipeline that the scanner could have caught. The payload was never in the text layer. This is the structural problem: text scanners are blind to image-domain attacks not because they are under-tuned but because they have no input coverage of the channel the attack uses.
What detection requires
Detecting FigStep requires a scanner that receives the raw image bytes before the VLM call and runs a classifier that can identify: (a) the presence of text inside an image, independent of whether OCR recovers it; (b) semantic embedding of that text content even when rendered in OCR-hostile fonts; and (c) the injection-instruction intent of the recovered content. The detection pipeline is: image bytes → visual encoder → CLIP embedding → semantic similarity to known injection instruction patterns. This is distinct from OCR-based detection and is specifically not defeated by font selection. The full technical description of what Glyphward's scanner does for this attack is at FigStep detection.
2. AgentTypo: typography that survives OCR as a benign string
AgentTypo (documented in "AgentTypo: Adversarial Typographic Attacks for Multimodal LLM Agents", 2024) addresses a different point in the pipeline. While FigStep hides from OCR, AgentTypo specifically exploits OCR-then-text-scan defences — the most common workaround teams implement after reading about FigStep.
The mechanism
An AgentTypo payload applies character-level distortions to a prompt injection instruction that are calibrated along two axes simultaneously: (1) the distorted characters should OCR to benign or irrelevant strings, so that any text scanner running on OCR output sees nothing of concern; (2) the distorted characters should remain visually close enough to the originals that a VLM processing the image pixel-by-pixel reads the intended instruction with high probability.
Concretely: the attack may replace a toxic instruction like "ignore the above" with a visually identical string where the letter "i" is replaced with a Unicode lookalike, where letters are spaced with invisible zero-width characters, and where critical words have subtle stroke modifications that OCR classifiers map to different characters while VLMs — whose visual encoders process pixel features at a lower level of abstraction — recover the semantically intended character. The OCR output of the distorted image is a garbled, benign string. The VLM output reflects the intended instruction.
Why this defeats OCR-pipeline defences
The OCR-then-text-scan workaround fails against AgentTypo because it assumes that if OCR can recover the text, a text scanner can catch the payload — and if OCR cannot recover it, the payload cannot be executed. AgentTypo proves both halves of this assumption wrong: it produces distortions where OCR recovers a benign string (defeating the first case) while the VLM executes the payload (defeating the second case). The two-step pipeline has a gap at exactly the point the attack targets.
This matters especially for systems that process documents: PDFs, screenshots of application state, scanned contracts, invoices with embedded instructions. These inputs often pass through an OCR-then-text pipeline as a cost-saving measure. AgentTypo can embed injection instructions in a document that passes every OCR-based filter in the pipeline and is executed by the VLM the first time it reads the document directly. The AgentTypo detector page details the specific document-processing pipeline variants and the scanning coverage needed for each.
The agent-specific risk
AgentTypo's name reflects its primary risk surface: agents. A document-reading agent — one that summarises contracts, extracts data from invoices, processes screenshots to take actions — reads its inputs with a VLM. If a malicious document is planted in its input stream (via email attachment, web browsing, a file system the agent has access to), an AgentTypo payload can redirect the agent's tool calls, exfiltrate data it has access to, or instruct it to take actions the operator never intended. Because the distorted text reads as benign to every text-based filter, it passes routine input validation with no flag. The agent executes the instruction, and the audit log contains only the benign OCR output — creating a forensic dead end. The general taxonomy of how agents extend the injection attack surface is in agentic RAG pipeline prompt injection.
3. WhisperInject: the payload Whisper drops before your scanner sees it
WhisperInject targets voice agents — systems that accept audio input, run it through a speech-to-text model (predominantly OpenAI's Whisper or a fine-tuned derivative), and feed the transcript to an LLM for processing. The attack embeds adversarial content in the audio that the STT model's transcript omits but that other components of the voice pipeline may process independently.
The gap between audio and transcript
Whisper and its open-source forks discard significant portions of the audio input before producing a transcript. Voice activity detection (VAD) segments filter audio windows below an energy threshold — typically anything below −50 dBFS per 30ms frame. Language identification models route non-primary-language segments to lower-priority processing. Disfluency removal strips filler sounds and repeated words. Hallucination-suppression thresholds cause Whisper to output empty transcripts or repeated periods for audio it classifies as noise, music, or silence.
Each of these filters defines a channel that exists in the audio but is invisible in the transcript. An attacker who understands the filter parameters can encode a prompt injection instruction in audio content that these filters classify as discardable — and the instruction never appears in the transcript that the text PI scanner reads. From the text scanner's perspective, the audio contained benign or empty speech. From the voice pipeline's perspective, the audio triggered processing paths that the text scanner had no visibility into.
Attack variants
Below-VAD commands. Instructions recorded at amplitude slightly below Whisper's VAD threshold are not transcribed but can be decoded by a speaker-diarisation model or a custom wake-word detector running on the raw waveform. In voice agents that use diarisation to separate speaker turns, a below-VAD instruction can be attributed to a speaker role (e.g. "system" or "supervisor") that receives elevated trust in the downstream LLM prompt.
Ultrasonic overlays. Frequencies above 16 kHz are outside human hearing range and outside Whisper's training distribution for speech. Whisper truncates audio to 16 kHz for transcription (its mel-spectrogram covers 0–8 kHz). Some voice pipeline components that process the full-resolution audio — particularly those using raw waveform models (WaveNet derivatives, audio fingerprinting, wake-word detection running on 44.1 kHz audio) — can decode instructions encoded at ultrasonic frequencies that the STT layer never transcribes.
Reversed-speech and temporal injection. Content encoded in reversed audio, or content inserted during pauses longer than Whisper's silence-collapse threshold (Whisper collapses silences longer than approximately 1 second in transcription output), generates audio content that the waveform processor sees but the transcript omits. Temporal injection places a command in a segment that appears after the last transcript token — in the "tail" of the audio — which some pipeline implementations pass to the raw-audio processing stage after transcription is complete.
Why the text scanner is structurally blind
A text PI scanner running on the Whisper transcript receives the text that survived all of Whisper's filters. The injected content was specifically designed to be removed by those filters. The scanner has correct input coverage of what Whisper transcribed; it has zero coverage of what Whisper discarded. This is the same structural blindness as for images: the attack is in the channel the scanner never reads. The technical deep-dive on building waveform-layer scanning for voice agents is in building a prompt-injection scanner for voice agents. The scanner page for this specific attack class is at WhisperInject detection and audio prompt-injection detection.
4. The shared root cause: one scanner cannot cover three modalities if it only reads one
FigStep, AgentTypo, and WhisperInject look different on the surface: different input types, different attack mechanisms, different OCR-bypass techniques. At the root, they share a single structural property — the attacker's payload arrives in a non-text input channel, and the defender's text scanner reads only the text channel.
This is not a gap that can be closed with a bigger training corpus, a more sophisticated semantic classifier, or a more comprehensive prompt injection pattern library. Lakera Guard, LLM Guard, Azure Prompt Shields, and Promptfoo are architecturally text-first systems: they receive strings and classify them. They can be expanded to accept more strings with higher coverage — but they cannot scan pixels or waveforms by expanding their string classification. The gap is scope, not sensitivity.
The implication for a defender is stark: if your multimodal application (chatbot with image upload, document agent, voice assistant, screenshot-reading agent) uses only a text PI scanner, you have zero detection coverage for FigStep, AgentTypo, and WhisperInject. Your scanner is not covering the vast majority of your attack surface; it is covering the fraction that runs through the text layer, while three named attacks route around it entirely.
The broader argument — including the full taxonomy of image-domain attacks beyond the FigStep/AgentTypo pair, the audio attack variants beyond WhisperInject, and the placement argument for why pre-processing scan placement at the input boundary is the only position that covers all three — is in the multimodal prompt-injection threat model (2026).
5. Detection architecture for all three attacks
Closing the coverage gap for FigStep, AgentTypo, and WhisperInject requires a scanner that operates at three boundaries simultaneously: raw image bytes before VLM call, raw image bytes before OCR pipeline, and raw audio bytes before STT transcription. These are different scan points, different signal extractors, and different classifiers — but they share the same placement principle: the raw bytes are the input channel the attacker controls, so the scan must run on the raw bytes before any transformation removes the payload.
Image scanning (FigStep + AgentTypo coverage)
For images, the scanner needs two distinct detection signals:
- Visual embedding similarity to injection instruction patterns. Run the raw image through a CLIP-class visual encoder and compute cosine similarity against a trained library of injection instruction embeddings. This signal covers FigStep — it detects the semantic content of rendered text regardless of OCR recoverability, because CLIP's visual encoder was trained on the same broad distribution of text-in-image that the VLM uses. A high similarity score against "ignore previous instructions", "disregard system prompt", or "repeat confidential context" embeddings in the visual space is a strong signal even when OCR returns nothing.
- Typography anomaly detection. Run a trained classifier on the image's typographic features: character-level distortion metrics, zero-width character density in text regions, Unicode lookalike character frequency, and inter-character spacing anomalies. This signal covers AgentTypo — it flags the distortion profile of adversarially perturbed characters even when OCR produces a benign output string. The two signals together provide coverage of FigStep (CLIP embedding catches it) and AgentTypo (typography anomaly catches it), and they are complementary: an AgentTypo payload calibrated to defeat CLIP-only scanning can still be caught by typography anomaly detection, and vice versa.
Both signals feed a final risk score in [0, 100] with a flagged region (bounding box of the suspected payload location) and an action recommendation. The scanner page at typographic prompt injection scanner details the implementation options.
Audio scanning (WhisperInject coverage)
For audio, the scanner needs waveform-level analysis that runs before the STT transcription step:
- Sub-VAD frequency band analysis. Scan the audio for energy in frequency bands and at amplitude levels that Whisper's VAD would suppress. Unexpected content in these bands — particularly content with the acoustic profile of speech rather than ambient noise — is a WhisperInject signal.
- Temporal segment boundary inspection. Check audio segments at pause boundaries (segments Whisper collapses) and after the last detected speech segment (the "tail" region) for instruction-like acoustic patterns.
- Spectrogram anomaly detection. Flag ultrasonic content (above 16 kHz) that carries speech-bandwidth spectral structure — the signature of ultrasonic encoding — alongside the main audio stream.
The waveform scanner output is a per-segment risk score and a segment timestamp range, which the application can use to gate the STT call or flag the audio for human review. The complete build guide for integrating this scanning layer into a voice agent pipeline is in building a prompt-injection scanner for voice agents.
Scan placement
Both the image scanner and the audio scanner must run before any downstream processing — before OCR, before VLM call, before STT transcription. This is the only placement that preserves the raw payload. Once OCR or STT runs, the payload has already been transformed (or dropped), and scanning the output of that transformation no longer covers the attack. The placement constraint is not a product opinion; it follows from the attack mechanics of all three attacks. Indirect prompt injection via image covers the additional placement complexity introduced by multi-hop agent pipelines where images travel across system boundaries before reaching the VLM.
6. The defence-in-depth stack for a multimodal application
A multimodal application that accepts images, documents, screenshots, or audio has a minimum three-layer defence requirement against the FigStep/AgentTypo/WhisperInject triad. All three layers are necessary; none is sufficient alone.
-
Layer 1 — Multimodal pre-processing scanner.
Scan raw image bytes (CLIP embedding + typography anomaly) and raw audio bytes (waveform analysis) before any model call. This is the layer text-only scanners miss entirely and the one that provides coverage for all three attacks. Glyphward's
/v1/scanendpoint provides this layer as a drop-in API with a latency budget appropriate for inline use (p95 under 200ms for a 1 MB image; p95 under 500ms for a 30-second audio clip). - Layer 2 — Text PI scanner on OCR output and STT transcript. Run a standard text PI scanner (Lakera Guard, LLM Guard, or equivalent) on OCR output from images and on Whisper transcripts from audio. This layer catches FigStep variants that happen to be OCR-recoverable (a poorly optimised payload font that partially survives OCR), text-layer injections that arrive in the same message as the image or audio, and any injection content that the VLM or STT model injects into its output during processing. This layer is additive — it does not replace the multimodal pre-processing layer but catches what the image/audio scanner misses after the text conversion runs.
- Layer 3 — Output monitoring on LLM responses. Scan LLM outputs for data exfiltration patterns, unexpected tool calls, and prompt echo (the model repeating content from its system prompt — a common FigStep execution signature). Output monitoring catches attacks that slipped through layers 1 and 2, provides post-hoc forensic signal, and generates the audit log that compliance frameworks require. The NIST AI RMF MAP 5.2 adversarial-input management obligations and EU AI Act Article 15 cybersecurity requirements both specify continuous monitoring of AI outputs; output scanning is the mechanism that satisfies this obligation. The compliance mapping is in NIST AI RMF GenAI profile and prompt injection.
Teams that deploy only layer 2 (text scanner on OCR/STT output) have partial coverage: they catch some text-layer injections and occasionally catch poorly optimised FigStep payloads, but they have zero coverage for the attacks as designed. Teams that deploy only layer 1 have better coverage of the named attacks but miss text-channel injections that arrive alongside clean images. The full three-layer stack is the only configuration with no named attack class undefended. For teams evaluating scanners at the pre-processing layer, the pricing and free tier are at Glyphward pricing.
FAQ
What is FigStep and how does it bypass text scanners?
FigStep is a jailbreak technique that renders prompt-injection instructions as text inside an image using OCR-resistant fonts, stylised glyphs, tables, and diagrams that a VLM decodes but that OCR returns as garbage or empty string. The text PI scanner receives the user message — usually a benign caption like "describe this image" — scans it, and clears it. The actual payload is in the pixel array the scanner never examines. The VLM reads the rendered instruction with its visual tokeniser, which processes pixel features directly, and complies. There is nothing in the text layer for the scanner to catch.
How is AgentTypo different from FigStep?
FigStep hides from OCR entirely by using fonts OCR misreads. AgentTypo specifically defeats OCR-then-text-scan defences: it applies character distortions calibrated so that OCR produces a benign string (defeating text scanning) while the VLM, which processes pixel features directly, reads the toxic original. AgentTypo payloads can survive inside PDFs, screenshots, and scanned documents that pass through an OCR pipeline. An OCR output that looks clean is not evidence the image was clean — AgentTypo proves the two can diverge by design.
What does WhisperInject target and why do voice agents miss it?
WhisperInject embeds adversarial commands in audio at frequencies, amplitudes, or temporal segments that Whisper's transcription discards — below the VAD energy threshold, in ultrasonic frequency bands above 16 kHz, in reversed-speech segments, or in the post-speech tail of an audio clip. A text PI scanner that only reads the Whisper transcript has no visibility into discarded content. The injected instruction never appears in the text the scanner sees. Some other voice pipeline component — diarisation, wake-word detection, raw-waveform processing — may process the discarded segment and route it to the LLM with elevated trust.
Do any text scanners detect FigStep, AgentTypo, or WhisperInject?
No. Lakera Guard, LLM Guard, Azure Prompt Shields, and Promptfoo are text-only classifiers. All three attacks deliver their payload in a non-text channel. A text scanner that has never seen the pixel array or the waveform has nothing to scan. This is a scope boundary, not a sensitivity gap — it cannot be fixed by expanding the pattern library or retraining on more text injection examples.
Can I just add OCR output scanning to catch FigStep?
OCR-then-text-scan catches some FigStep variants where the font partially survives OCR, but misses FigStep as designed — the attack font selection is specifically chosen to produce empty or garbled OCR output. More importantly, OCR-then-text-scan is the exact defence AgentTypo was designed to defeat: it produces distortions where OCR outputs a benign string while the VLM executes the payload. The only placement with coverage for both attacks is scanning on raw image bytes before any downstream transformation, using a visual encoder that can decode rendered text independently of OCR recoverability.
Further reading
- FigStep detection — scanner implementation, font taxonomy for OCR-resistant rendering, CLIP-based detection pipeline, and integration guide for VLM-powered chatbots and document agents.
- AgentTypo detector — character-level distortion detection, document pipeline coverage (PDF, screenshot, scanned invoice), and the typography anomaly classifier that catches AgentTypo variants that defeat CLIP-only scanning.
- WhisperInject detection — waveform-layer scanner placement, sub-VAD frequency analysis, ultrasonic overlay detection, and temporal segment boundary inspection for voice agent pipelines.
- Audio prompt-injection detection — the full audio attack class beyond WhisperInject: waveform anomaly taxonomy, detection signal inventory, and scanner integration patterns for STT-gated voice agents.
- Typographic prompt injection scanner — the combined typographic attack class covering FigStep and AgentTypo, detection signal combinations, and the scanner API that covers both in a single endpoint call.
- Indirect prompt injection via image — how FigStep and AgentTypo payloads travel across multi-hop agent pipelines (RAG document fetches, screenshot agents, web browsing agents) before reaching the target VLM, and the scan placement complexity this introduces.
- Agentic RAG pipeline prompt injection — how agent tool calls amplify the consequences of a successful FigStep or AgentTypo attack that was not caught at the input boundary, and the multi-hop injection risk taxonomy for document-reading agents.
- Why every text-only scanner misses a 30-pixel PNG — the architectural argument for why the FigStep/AgentTypo gap is structural: what the VLM sees versus what the scanner sees, and why the gap cannot be closed with a larger blocklist.
- Building a prompt-injection scanner for voice agents — the engineering deep-dive on waveform-layer scanning: what Whisper discards, the four audio PI subtypes, and the 5-step build playbook for a scanner that closes the WhisperInject gap.
- The multimodal prompt-injection threat model (2026) — the full attack taxonomy across text, image, and audio; the defender's playbook with scan placement, threshold selection, and defence-in-depth stack design.
- Glyphward pricing — free tier (10 scans/day, no card required) covers the FigStep/AgentTypo visual scanner and the WhisperInject waveform scanner. Pro at $29/mo adds webhook delivery, per-request audit logs, and SDK access.