Category overview · LLM security

Multimodal LLM security API

An LLM security API is a service you call to decide whether a request to a model should proceed. A multimodal LLM security API extends that decision to inputs that are not strings — pixels and waveforms. The distinction matters because every public LLM-security service in the under-$100/mo tier reads strings only, while the production attack surface for image-understanding and voice products lives at the modality layer the service does not see.

TL;DR

Text PI scanners — Lakera Guard, LLM Guard, Azure Prompt Shields, Promptfoo Cloud — are necessary, useful, and widely deployed. They do not read pixels or waveforms. A multimodal LLM security API does. The interface is one HTTP endpoint that accepts a payload of any supported modality and returns a 0–100 risk score, the modality classification, the flagged region, and the signal-by-signal evidence. It sits between your input handler and your model call.

Why “multimodal LLM security” is its own category

The text-PI scanner category was forged in 2023 around models like Lakera’s Gandalf and the original LLM-Guard library. The threat surface at that time was a string sent to a chat model. Since then, two structural shifts have created a category gap:

  1. Vision-language and audio-language models in production. GPT-4V, Claude 3, Gemini 1.5+, the open-weight Qwen-VL line, and audio-LLMs from OpenAI and Google all accept non-string inputs. Once those inputs reach the model, the inspection layer that runs before them must speak the same modalities. A scanner that only reads strings cannot defend a model that reads pixels.
  2. Self-serve consolidation up-market. Public reporting indicates Lakera was acquired by Check Point in late 2025; the post-acquisition trajectory is enterprise-first. The under-$100/mo self-serve tier — the band most AI startups can budget for — has therefore narrowed at exactly the moment image and audio surfaces went mainstream. The OWASP LLM Top 10 recognises Prompt Injection (LLM01) at the category level; vendor coverage at the modality level is uneven.

“Multimodal LLM security API” is the category that closes that gap: a self-serve API that explicitly inspects pixels and waveforms in addition to text, and returns a single normalised risk score the rest of your pipeline can consume.

Four modalities, four threat surfaces

Treat the modalities as independent surfaces with overlapping but distinct attack patterns:

  1. Text. The legacy surface. System-prompt overrides, jailbreaks, indirect injection from RAG documents. Well-defended by every vendor. A multimodal API still inspects text — replacing your text-side filter is not the goal — but it is rarely the bottleneck of risk.
  2. Image (pixels). Typographic prompt injection (FigStep, AgentTypo), Unicode-confusable rendering, indirect injection via screenshots, adversarial-glyph payloads, perturbation-based confusion. See the typographic PI scanner page.
  3. Audio (waveform). WhisperInject-class out-of-band carriers, silence steganography, adversarial waveform perturbation, multi-speaker overlay. See audio prompt-injection detection.
  4. Composites. A single document combining a small typographic block with a covert audio attachment, or a screencast carrying both image-borne and audio-borne payloads. The composite case is the easiest to overlook because it sits across vendor boundaries; the multimodal API treats it as one scan.

The threat surface is not “one big bag of bytes”. It is four surfaces that share a structural property: the artefact a text-only defender reads is not the artefact the model acts on. See the 2026 threat model for the long-form treatment.

What a real multimodal LLM security API does

Five capabilities are table stakes for the category:

  1. Modality coverage. Image and audio at a minimum, in addition to text. A vendor that says “we are working on image” has not shipped it; ask for the public sample-set numbers.
  2. Single-endpoint surface. One URL, one auth header, one response shape regardless of modality. The integration cost of routing across three different vendors is itself a security risk — every routing branch is a place to forget to call the scanner.
  3. Per-signal evidence in the response. A score alone is not enough; integrators need to know which signals fired so they can write source-aware policies (block on ≥80, log on ≥60 from low-trust sources, etc.).
  4. A free tier that lets you reproduce the marketing claims. 10 scans a day against the public FigStep, AgentTypo, and WhisperInject samples is sufficient — see prompt-injection API with a real free tier.
  5. A compounding shared corpus. Every confirmed payload one tenant scans should become a near-neighbour signal for every tenant. A scanner that does not compound across users is a one-shot ML model dressed up as a service.

API shape — one endpoint, one response

The Glyphward `/v1/scan` endpoint accepts any of:

The response is a single JSON object: a 0–100 risk score, the inferred modality, the flagged region (bounding box for images, timestamp range for audio), and the per-signal confidences from the underlying ensemble. Latency is sub-200 ms p95 in the typical scan envelope. Authentication is a Bearer token; rate limits map to the plan tier.

Where it fits in the inference pipeline

The canonical placement for the scanner is between your input handler and the model call:

  1. Input arrives from the user — a chat message with an attached image, a voice call, a screenshot from a Computer Use agent.
  2. Scan with `/v1/scan`. Tag the source-trust level (own UI, user upload, third-party fetch, screenshare).
  3. Apply policy on the returned score. Block, downgrade, route to a human reviewer, or pass.
  4. Call the model only on pass. Cache the score with the input hash for short-window re-use.

For text-on-text traffic, keep your existing text PI scanner — Glyphward is additive, not a replacement. For image and audio traffic, this is where the inspection layer lives.

How Glyphward implements the category

Glyphward is the multimodal LLM security API for AI products under $500 a month of model spend — where Lakera’s post-acquisition pricing and Azure’s Azure-only gating both leave a gap. The free tier (10 scans/day, no card) covers public-sample reproduction; Pro at $29/mo and Team at $99/mo cover production volume. Honest vendor comparisons at vs Lakera Guard, vs Azure Prompt Shields, and vs LLM Guard. Among self-serve scanners under $100/mo, Glyphward is currently the only one with a production audio pipeline.

Get early access

Related questions

Is this a replacement for Lakera Guard or LLM Guard?

No. Run them on text — they do that well. Run Glyphward on image and audio. Many production stacks combine a text-PI scanner with Glyphward and treat the two as independent layers. See vs Lakera Guard and vs LLM Guard.

What about Azure Content Safety / Prompt Shields?

Azure’s image moderation flags policy-violating visual content (nudity, violence, etc.); it is not a prompt-injection detector. Prompt Shields itself is text-only and Azure-tenant-gated. Cross-cloud teams use Glyphward as the multimodal half. See vs Azure Prompt Shields.

Does this work for any model — Claude, GPT-4o, Gemini, Llama-vision?

Yes. The scanner runs on the input bytes before the model call, so it is model-agnostic. The same scan request defends a Claude pipeline and a GPT-4o pipeline identically.

What is the latency overhead?

Sub-200 ms p95 in the typical scan envelope. For voice agents already running Whisper-small you can share the transcript and the marginal cost is the waveform path alone — see prompt-injection scanner for voice agents for the latency budget breakdown.

Where does the corpus come from?

Public attack samples from the FigStep, AgentTypo, WhisperInject, and indirect-PI literature, plus the compounding corpus of confirmed payloads scanned through the service. Customer payloads are not exposed cross-tenant; only the resulting signal vectors contribute to the shared near-neighbour index.

Further reading