Category overview · LLM security

Multimodal LLM security API

An LLM security API is a service you call to decide whether a request to a model should proceed. A multimodal LLM security API extends that decision to inputs that are not strings — pixels and waveforms. The distinction matters because every public LLM-security service in the under-$100/mo tier reads strings only, while the production attack surface for image-understanding and voice products lives at the modality layer the service does not see.

TL;DR

Text PI scanners — Lakera Guard, LLM Guard, Azure Prompt Shields, Promptfoo Cloud — are necessary, useful, and widely deployed. They do not read pixels or waveforms. A multimodal LLM security API does. The interface is one HTTP endpoint that accepts a payload of any supported modality and returns a 0–100 risk score, the modality classification, the flagged region, and the signal-by-signal evidence. It sits between your input handler and your model call.

Why “multimodal LLM security” is its own category

The text-PI scanner category was forged in 2023 around models like Lakera’s Gandalf and the original LLM-Guard library. The threat surface at that time was a string sent to a chat model. Since then, two structural shifts have created a category gap:

Vision-language and audio-language models in production. GPT-4V, Claude 3, Gemini 1.5+, the open-weight Qwen-VL line, and audio-LLMs from OpenAI and Google all accept non-string inputs. Once those inputs reach the model, the inspection layer that runs before them must speak the same modalities. A scanner that only reads strings cannot defend a model that reads pixels.
Self-serve consolidation up-market. Public reporting indicates Lakera was acquired by Check Point in late 2025; the post-acquisition trajectory is enterprise-first. The under-$100/mo self-serve tier — the band most AI startups can budget for — has therefore narrowed at exactly the moment image and audio surfaces went mainstream. The OWASP LLM Top 10 recognises Prompt Injection (LLM01) at the category level; vendor coverage at the modality level is uneven.

“Multimodal LLM security API” is the category that closes that gap: a self-serve API that explicitly inspects pixels and waveforms in addition to text, and returns a single normalised risk score the rest of your pipeline can consume.

Four modalities, four threat surfaces

Treat the modalities as independent surfaces with overlapping but distinct attack patterns:

Text. The legacy surface. System-prompt overrides, jailbreaks, indirect injection from RAG documents. Well-defended by every vendor. A multimodal API still inspects text — replacing your text-side filter is not the goal — but it is rarely the bottleneck of risk.
Image (pixels). Typographic prompt injection (FigStep, AgentTypo), Unicode-confusable rendering, indirect injection via screenshots, adversarial-glyph payloads, perturbation-based confusion. See the typographic PI scanner page.
Audio (waveform). WhisperInject-class out-of-band carriers, silence steganography, adversarial waveform perturbation, multi-speaker overlay. See audio prompt-injection detection.
Composites. A single document combining a small typographic block with a covert audio attachment, or a screencast carrying both image-borne and audio-borne payloads. The composite case is the easiest to overlook because it sits across vendor boundaries; the multimodal API treats it as one scan.

The threat surface is not “one big bag of bytes”. It is four surfaces that share a structural property: the artefact a text-only defender reads is not the artefact the model acts on. See the 2026 threat model for the long-form treatment.

What a real multimodal LLM security API does

Five capabilities are table stakes for the category:

Modality coverage. Image and audio at a minimum, in addition to text. A vendor that says “we are working on image” has not shipped it; ask for the public sample-set numbers.
Single-endpoint surface. One URL, one auth header, one response shape regardless of modality. The integration cost of routing across three different vendors is itself a security risk — every routing branch is a place to forget to call the scanner.
Per-signal evidence in the response. A score alone is not enough; integrators need to know which signals fired so they can write source-aware policies (block on ≥80, log on ≥60 from low-trust sources, etc.).
A free tier that lets you reproduce the marketing claims. 10 scans a day against the public FigStep, AgentTypo, and WhisperInject samples is sufficient — see prompt-injection API with a real free tier.
A compounding shared corpus. Every confirmed payload one tenant scans should become a near-neighbour signal for every tenant. A scanner that does not compound across users is a one-shot ML model dressed up as a service.

API shape — one endpoint, one response

The Glyphward `/v1/scan` endpoint accepts any of:

A URL to image bytes (PNG, JPG, WebP, common variants).
A URL to audio bytes (WAV, MP3, common containers).
A multipart upload of either.
A text payload (handled, but treated as a complement to your existing text-PI scanner, not a replacement).

The response is a single JSON object: a 0–100 risk score, the inferred modality, the flagged region (bounding box for images, timestamp range for audio), and the per-signal confidences from the underlying ensemble. Latency is sub-200 ms p95 in the typical scan envelope. Authentication is a Bearer token; rate limits map to the plan tier.

Where it fits in the inference pipeline

The canonical placement for the scanner is between your input handler and the model call:

Input arrives from the user — a chat message with an attached image, a voice call, a screenshot from a Computer Use agent.
Scan with `/v1/scan`. Tag the source-trust level (own UI, user upload, third-party fetch, screenshare).
Apply policy on the returned score. Block, downgrade, route to a human reviewer, or pass.
Call the model only on pass. Cache the score with the input hash for short-window re-use.

For text-on-text traffic, keep your existing text PI scanner — Glyphward is additive, not a replacement. For image and audio traffic, this is where the inspection layer lives.

How Glyphward implements the category

Glyphward is the multimodal LLM security API for AI products under $500 a month of model spend — where Lakera’s post-acquisition pricing and Azure’s Azure-only gating both leave a gap. The free tier (10 scans/day, no card) covers public-sample reproduction; Pro at $29/mo and Team at $99/mo cover production volume. Honest vendor comparisons at vs Lakera Guard, vs Azure Prompt Shields, and vs LLM Guard. Among self-serve scanners under $100/mo, Glyphward is currently the only one with a production audio pipeline.

Get early access