Attack explainer · Image injection

AgentTypo detector — when OCR reads your adversarial glyphs as benign

AgentTypo is the adversarial-typography sibling of FigStep. Instead of relying on anti-OCR fonts by accident, AgentTypo deliberately distorts glyph shapes so that Tesseract-class OCR either drops the text or returns a benign near-miss — while the downstream vision-language model still reads the attacker’s intent.

TL;DR

AgentTypo splits text-in-image defenders between two failure modes: (a) the OCR pipeline misreads the glyphs and passes a clean string to your text scanner, or (b) the OCR returns empty and you assume the image is benign. The VLM reads the intent either way. Reliable detection requires a pixel-level classifier that does not depend on OCR recovering the string.

How AgentTypo works

The AgentTypo line of work targets the OCR–text-filter pipeline directly. Three distortion techniques dominate the public literature:

Per-glyph perturbation. Small targeted noise applied to each rendered letter. Humans still read the sentence; VLMs trained on typographic corpora still extract meaning; OCR engines fall off a cliff at fairly modest perturbation budgets.
Substitution and confusables. Unicode confusables (Cyrillic “а” for Latin “a”, Greek “ο” for Latin “o”) rendered as pixels. OCR often outputs the Latin string — which your blocklist does not match — while the VLM pools the same semantic embedding regardless of the code-point.
Anti-OCR fonts and layout. Low contrast, tight kerning, hand-drawn or low-resolution variants. OCR either returns nothing or returns garbage tokens that defeat regex blocklists and text PI models alike.

The underlying asymmetry is the attack surface: OCR is trained on clean document imagery; VLMs are trained on messy, in-the-wild web scrapes and generalise far better to adversarial glyphs. Any defence that routes through OCR before inspection inherits the weakest component.

Why an OCR-first pipeline fails

The common production pattern is: decode image → Tesseract (or a hosted OCR) → run the OCR output through your existing text prompt-injection scanner. AgentTypo was designed to break exactly that pipeline. When OCR misreads, the scanner sees a benign string and marks the request clean. When OCR returns empty, the scanner does not run at all, and the image is forwarded to the VLM with no inspection.

Upgrading OCR alone does not help for long. Every improvement in OCR robustness raises the perturbation budget the attacker needs, but the VLM’s ability to read distorted glyphs improves in lockstep — the attacker simply tunes the distortion to the gap. The structural fix is to stop treating OCR output as the thing you scan. Scan the pixels.

How to detect AgentTypo at inference time

Pixel-level detection that does not depend on OCR-recovered text runs four signals in parallel:

Text-in-image likelihood head. A small model that predicts “does this image contain instructional text” — numbered steps, imperative verbs, list structure — without needing to read individual letters. Fires on AgentTypo output because the layout of rendered instructions survives per-glyph perturbation.
Visual-embedding nearest-neighbour over a known-payload corpus. CLIP-style embeddings compared against a compounding shared corpus of seen AgentTypo payloads. Catches paraphrases and font-swaps that still land near a known-malicious region of embedding space.
OCR with confusable normalisation. Still runs — but results are normalised (Cyrillic homoglyphs folded to Latin, diacritics stripped) before any downstream match, so confusables cannot hide the string.
Perturbation-signature classifier. A cheap detector trained on the noise-pattern footprint typical of adversarial glyphs — high-frequency artefacts that clean rendered text does not produce.

Two signals firing above threshold is a hard block; one is cause to route the call to a human or downgrade the agent’s privileges. False positives land on hand-drawn memes and poor-quality photos of real signage — both easy to suppress with corpus rules.

How Glyphward detects AgentTypo

Glyphward runs all four signals on every scan and returns a 0–100 risk score plus the flagged pixel region. No GPU to host, no corpus to curate. Free tier gives you 10 scans a day — enough to throw the public AgentTypo samples at it and see where your existing OCR pipeline was lying to you. See the flow on the landing page, compare coverage on Lakera alternative (multimodal), or go to pricing.

Get early access