Compare · Glyphward vs Promptfoo
Glyphward vs Promptfoo
Promptfoo is an open-source test harness and red-team eval framework — you run it at CI time to check what your model does when adversaries push on it. Glyphward is an inference-time scanner — it sits inline with production traffic and scores image and audio bytes before they reach your VLM or STT. Different layers, different latency budgets, different consumers. They are not substitutes; the strongest stacks run both.
TL;DR
Use Promptfoo to evaluate your defences against adversarial test cases on every PR. Use Glyphward as one of the defences Promptfoo evaluates and the runtime scanner that catches whatever slips through CI. Removing Promptfoo means you stop measuring your guardrail; removing Glyphward means malicious uploads reach the model unscored. Removing both is roughly where most multimodal apps sit today.
What each product actually is
Promptfoo is an MIT-licensed CLI and library — npx promptfoo — that runs a configurable matrix of test prompts against one or more model providers and reports pass/fail per assertion. Their redteam command bundles canonical jailbreak and prompt-injection payloads (DAN families, indirect-PI, role-play attacks, and more) so you can flush regressions before shipping. Promptfoo Cloud adds dashboards and managed history on top of the OSS core. The tool's job is to tell you whether your model still misbehaves.
Glyphward is a managed HTTPS API. You POST an image or audio file to /v1/scan and receive a 0–100 risk score, modality-tagged reasons, and bounding-box coordinates on flagged pixels or waveform windows. We run the detector models — FigStep and AgentTypo-trained text-in-image heads, waveform-anomaly plus Whisper-small transcript ensemble for audio — and we keep them current as new attack vectors land. The tool's job is to score the bytes a real user just sent.
Honest feature table
| Promptfoo | Glyphward | |
|---|---|---|
| Layer in the stack | Eval / red-team / CI | Inline scanner / inference path |
| Triggered by | Developer at CI time | End-user upload at request time |
| Subject under test | Your LLM and any guardrails on it | The uploaded image or audio bytes |
| Output | Pass/fail matrix per test case | Per-request risk score + flagged regions |
| Latency budget | Minutes per suite is fine | Sub-200 ms p95 |
| Multimodal coverage | Provider-dependent; harness handles bytes if your assertions do | Image + audio first-class |
| Licence | MIT (OSS) + Promptfoo Cloud (paid) | Commercial managed service |
| Hosting | Self-host the OSS, or Cloud | Managed (our infra) |
| Pricing | OSS free · Cloud per quote | Free 10/day · $29/mo Pro · $99/mo Team |
| Owns the corpus | You curate yours | We curate, signatures shared across customers |
Where Glyphward wins
- Inline scoring of real user bytes. Promptfoo cannot block a malicious upload at request time — by design, it is not on the request path. Glyphward returns a risk score in time for you to decide whether to pass, log, or block before bytes reach your VLM.
- Multimodal first-class. Glyphward ships image and audio detectors with bounding boxes and flagged waveform windows. Promptfoo can drive multimodal test cases, but the detectors are still your responsibility — Promptfoo orchestrates, it does not classify pixels.
- Curated corpus that compounds. We benchmark against a labelled set of FigStep, AgentTypo, indirect image PI, and WhisperInject payloads. The corpus grows as customers scan; that compounding effect is a managed-service advantage Promptfoo cannot replicate without becoming a different product.
- Latency budget Promptfoo doesn't target. A request-path scanner has to answer fast. Glyphward targets sub-200 ms p95. Eval-time tooling has minutes per suite, which is the right budget for what eval-time tooling does — and the wrong budget for inline scanning.
Where Promptfoo wins
- The right tool for measuring your defence. Promptfoo is how you discover that FigStep variant 17 still produces a successful jailbreak in your VLM, or that your latest system prompt regressed against a payload class. Glyphward does not do this — Glyphward is one of the defences you would point Promptfoo at.
- Provider breadth. Promptfoo runs against OpenAI, Anthropic, Google, Mistral, local Ollama, your custom HTTP endpoint, and dozens more. As a model-evaluation harness it has the broadest provider matrix in the category.
- CI-native. The output is YAML config and pass/fail assertions that drop into GitHub Actions. The unit of work is "did this PR regress against my red-team suite", which is exactly the question CI is built to answer.
- OSS, no vendor lock-in. MIT licence, runs entirely on your laptop or in your CI runner, never has to phone home. You own the corpus, you own the assertions, you own the report.
- Adjacent jobs. Promptfoo is also a hill-climb harness for prompt engineering, a regression detector for prompt changes, and a scoring framework for arbitrary LLM behaviour. The red-team mode is one of several uses; if you only need a scanner, that is a narrower problem.
When to pick which
Pick Promptfoo if you need pre-deploy assurance that your LLM and its guardrails behave under adversarial input — and especially if you do not yet have a red-team suite at all. Promptfoo is the lowest-friction path from "no eval" to "passing eval on every PR".
Pick Glyphward if you accept user-uploaded images or audio in production and want a managed scanner you don't have to operate. Most multimodal apps we see — avatar SaaS, voice agents, screenshot-reading agents — do not want to host another inference stack just to score uploads.
Run both is the default recommendation for any team taking multimodal PI seriously. Promptfoo at CI time, scoring whether Glyphward and your other defences still catch the latest payload classes; Glyphward at request time, scoring real user bytes before they reach your model. Two layers, two latency budgets, two consumers.
Integration sketch (running both)
The cleanest pattern is to register Glyphward as an HTTP provider inside your promptfooconfig.yaml and write assertions that confirm Glyphward correctly flags a known-malicious image at score ≥ 70 and clean stock photos at < 30. Run the suite in CI on every PR. When Glyphward's recall on the malicious set drops or the false-positive rate on the clean set climbs, the build fails before it ships. Promptfoo's free tier is enough; Glyphward's free tier (10 scans/day) covers a small CI suite, and the Pro tier ($29/mo, 100k scans) covers a much larger one.
A worked YAML example with the exact provider config, asserts, and how the corpus is laid out lives in our Promptfoo + multimodal scanning recipe.
FAQ
Does Promptfoo not include any image or audio detectors of its own?
Promptfoo orchestrates providers and asserts on outputs. The detectors that classify whether an image is a FigStep payload or whether audio carries a WhisperInject pattern are not part of the harness — that is what a scanner like Glyphward does, and that is what your assertions point at. If Promptfoo ships native multimodal detectors in the future we will update this page; their roadmap is theirs to announce.
Can I just call Glyphward from a Python test in pytest and skip Promptfoo?
Yes, and many teams do for a small suite. You give that up the broader benefits Promptfoo brings — provider matrix, dashboards, prompt regression tracking, the bundled red-team payload library. For a suite of a few dozen samples on one provider, hand-rolled tests are fine; once you have multiple providers or want regression history, Promptfoo's harness pays for itself.
Is there overlap between Promptfoo's red-team payloads and Glyphward's training corpus?
Some — both draw on the public attack literature (FigStep, AgentTypo, WhisperInject, indirect-PI). The difference is what each of us does with those payloads. Promptfoo uses them as inputs to drive its eval matrix. We use them as labelled training and evaluation data for the detector models, and we add real flagged samples customers see in production. Different role for the same source material.
What about latency?
Promptfoo's latency budget is "however long the suite needs"; tens of seconds to minutes per run is normal. Glyphward targets sub-200 ms p95 because it sits on the request path. The two budgets are not compatible — a CI tool optimised for tens of minutes is the wrong shape for a scanner that has to answer in 200 ms, and vice versa.
Will running both double my bill?
No. Promptfoo OSS is free; the cost is the CI runner time you already pay for. Glyphward starts at $0 and is a flat monthly rate above that. Running both does not compound — they sit at different points in your pipeline and bill against different budgets.
Further reading
- Promptfoo + multimodal scanning — full YAML config, provider definition, and CI recipe for using Glyphward inside Promptfoo.
- FigStep detection, AgentTypo detector, and Audio PI detection — the attack classes you would point Promptfoo at to test Glyphward's recall.
- Full multimodal PI scanner pricing comparison.
- Glyphward vs Lakera Guard · Glyphward vs LLM Guard · Glyphward vs Azure Prompt Shields.