Compare · Promptfoo
Promptfoo + multimodal scanning: Glyphward as the image and audio detector in your Promptfoo pipeline
Promptfoo is the open-source eval-and-red-team harness most LLM-app teams reach for first. It is excellent at running text adversarial suites against your models and asserting on outputs. It is not a real-time inference-path scanner, and it is not pixel-aware. If your eval suite needs to cover image and audio prompt-injection — and your runtime needs to block those payloads at request time — Glyphward fits inside Promptfoo, not against it.
TL;DR
Promptfoo is eval-time. Glyphward is inference-time. They are different layers of the stack, not different products in the same category. Use Promptfoo to red-team your model with adversarial multimodal payloads (FigStep, AgentTypo, WhisperInject corpora) and assert on whether Glyphward catches them. Use Glyphward in production to actually catch them at request time.
Two different jobs
The category confusion is worth dispelling up front, because it shows up in every comparison shopper's notebook:
- Promptfoo is a test harness. You write a YAML config that describes a set of test cases (prompts, expected behaviours, providers, assertions), point it at one or more LLMs, and it runs the matrix and gives you a report. Their
redteamcommand bundles known adversarial payloads to flush jailbreaks. The model under test is the system you are evaluating; Promptfoo itself does not sit inline with your production traffic. - Glyphward is an inline scanner. You call
POST /v1/scanon every image and audio upload before it reaches your VLM or STT layer; you get back a 0–100 score plus modality-tagged reasons; you decide whether to pass, log, or block. Glyphward never sees your model — it sees the bytes the user sent.
You absolutely want both. Promptfoo tells you, at CI time, that FigStep payload variant 17 produces a successful jailbreak in your VLM. Glyphward tells you, at request time, that this specific upload from this specific user looks like a FigStep variant; here is the score, here is the flagged region. Different signals, different consumers, different latency budgets.
The integration recipe
The cleanest pattern is to write Promptfoo assertions that check whether Glyphward correctly flags a known-malicious image, then run that suite continuously.
Step 1 — corpus. Maintain a folder of labelled adversarial samples. Public starting points include the FigStep paper's released payloads, the AgentTypo supplementary material, and any WhisperInject samples you have access to. Glyphward customers get a curated subset on request.
Step 2 — Promptfoo provider. Define an HTTP provider in your promptfooconfig.yaml that posts each sample to the Glyphward scan endpoint, and surfaces the returned score:
providers:
- id: glyphward
config:
url: https://glyphward.com/v1/scan
method: POST
headers:
Authorization: Bearer ${GLYPHWARD_API_KEY}
body:
image_b64: "{{image_b64}}"
transformResponse: |
return { output: String(json.score), score: json.score };
tests:
- description: FigStep variant 17 should score >= 70
vars:
image_b64: file://corpus/figstep_17.png.b64
assert:
- type: javascript
value: parseFloat(output) >= 70
- description: Clean stock photo should score < 30
vars:
image_b64: file://corpus/clean_001.jpg.b64
assert:
- type: javascript
value: parseFloat(output) < 30
Step 3 — run on every PR. Add the suite to CI. If recall on the malicious set drops or false-positive rate on the clean set climbs, the build fails before it ships. Glyphward's free tier is enough to run a small CI suite; the Pro tier covers a larger one.
Architectural difference
| Promptfoo | Glyphward | |
|---|---|---|
| Layer | Eval / red-team / CI | Inline scanner / inference path |
| Triggered by | Developer at CI time | End-user upload at request time |
| Subject under test | Your LLM (and any guardrails on it) | The uploaded image or audio bytes |
| Output | Pass/fail matrix per test case | Per-request risk score + flagged regions |
| Latency budget | Minutes per suite is fine | Sub-200ms p95 required |
| Pricing | OSS + Promptfoo Cloud (paid) | Free 10/day · $29/mo Pro · $99/mo Team |
| Multimodal coverage | Provider-dependent; harness handles bytes if your assertions do | Image + audio first-class |
The architectural lesson lands in one line: eval-time and inference-time are not substitutes. Removing the eval suite does not protect your users; removing the inline scanner means every malicious upload reaches the model. Removing both leaves you with nothing in either layer, which is roughly where most multimodal apps sit today.
What this looks like in production
A typical Glyphward + Promptfoo deployment, end-to-end:
- CI — Promptfoo runs the multimodal suite on every PR. Build fails if Glyphward's recall on the corpus regresses past your threshold.
- Pre-deploy — Promptfoo's
redteamgenerates fresh adversarial variants and probes your full stack (Glyphward + your VLM) end-to-end. New variants that bypass go into the corpus. - Production — every image and audio upload calls Glyphward inline. Score + reasons + region land in your request log; flagged regions feed back into the corpus.
- Weekly review — last 7 days of flagged-but-passed samples (the grey-band middle scores) get reviewed manually and labelled. New labels feed the next CI run.
This is the loop. Promptfoo enforces the contract; Glyphward provides the scanner that the contract is enforced against; the corpus updates from production reality. None of the three steps replaces another.
When to pick which
- Promptfoo only works if your application is text-only, you red-team in CI, and you do not need an inline scanner because no untrusted bytes ever reach your model.
- Glyphward only works if you want production protection and are not yet ready to invest in a CI eval suite. Most teams add Promptfoo on top within a quarter once the inline scanner is paying off.
- Both is the default for any multimodal app with users. Eval and inference are independent insurance — neither covers the other's failure mode.
What the integration costs
Adding Glyphward to an existing Promptfoo setup is a YAML provider definition and an API key. There is no Glyphward SDK to install in CI — Promptfoo's HTTP provider already speaks REST. The reverse direction is just as cheap: an existing Glyphward-protected app can adopt Promptfoo by writing one config file. The reason this combination works is that neither product reaches into the other's surface area.
Related questions
Doesn't Promptfoo already have a redteam image module?
Promptfoo's redteam generates adversarial test cases and runs them through providers; it does not implement a pixel-level PI detector. If your provider is your own VLM with no scanner in front of it, you are checking model robustness, not detector recall. Adding Glyphward as a provider — or as a wrapper provider that fronts your VLM — gives you a recall metric to assert on.
Why not just call Promptfoo's eval suite at request time?
Latency. A request-time call has a sub-200ms budget; a Promptfoo eval is minutes-per-suite by design because it is exercising a matrix, not scoring a single sample. They are optimised for opposite axes.
Can I export Glyphward's flagged samples back into Promptfoo?
Yes — that is the whole point of the corpus loop. Pro and Team tiers expose flagged-region exports you can drop into a Promptfoo test corpus, so production discoveries become CI assertions on the next deploy.
Is Promptfoo Cloud relevant to this comparison?
Promptfoo Cloud is the hosted/team version of the eval harness — it changes how you run the suite, not what category it is. The eval-vs-inference distinction holds whether you self-host Promptfoo or pay for Cloud.
Do you have a starter Promptfoo config we can clone?
Email hello@glyphward.com once you have a Pro key and we will send you the working promptfooconfig.yaml and a labelled starter corpus (FigStep, AgentTypo, WhisperInject samples) so you can stand up the CI suite in under an hour.
Further reading
- Multimodal PI scanner pricing comparison (2026) — full market table; Promptfoo and Glyphward live in adjacent rows, not the same one.
- Multimodal LLM security API — the broader category Glyphward fits in.
- FigStep detection · AgentTypo detector · WhisperInject detection — the corpora your Promptfoo suite should test against.
- The multimodal prompt-injection threat model for AI product teams (2026) — what Promptfoo's red-team output should be looking for.