Category · Security testing

Multimodal AI security testing — from red-team eval to inference-time detection

Multimodal AI security requires two distinct control layers: eval-time red-teaming (discovering what attack payloads break your system before deployment) and inference-time scanning (blocking known and novel attack payloads in production). These layers are complementary — one cannot substitute for the other. Promptfoo and Garak are eval-time tools: they run attack payloads against your model in a test harness and measure success or failure. Glyphward is an inference-time tool: it runs at every request, scans the image and audio bytes before they reach your model, and rejects payloads above threshold. A multimodal AI security programme needs both layers; this page explains how they fit together and where each is insufficient alone.

TL;DR

Use Promptfoo or Garak to find multimodal PI vulnerabilities in pre-deployment red-team evals. Use Glyphward at inference time to block FigStep, AgentTypo, and WhisperInject payloads in production. Point your Promptfoo CI suite at Glyphward as the runtime scanner under test to verify that the scan gate correctly blocks your red-team corpus. Promptfoo + Glyphward integration guide here.

Eval-time vs inference-time: the two layers of multimodal AI security

The distinction matters because the two layers address different threat models:

Eval-time red-teaming runs before deployment. You give an evaluation tool a set of attack payloads (adversarial images, audio with injected instructions, text prompts), point it at your model or application, and measure whether the model produces a harmful or unexpected response. This tells you whether your model is vulnerable to known attacks. Red-teaming catches the vulnerabilities that made it through safety training and system-prompt hardening. It does not run in production.

Inference-time scanning runs in production, on every request, before the model call. It does not test the model's response to attacks — it prevents the attack payload from reaching the model in the first place. If the scanner flags a request, the model never sees the adversarial content and never produces a harmful response. The scanner is the gatekeeper; the model's safety properties are the fallback.

A common mistake is to treat a passing red-team eval as a production security guarantee. It is not — a red-team eval tests a known corpus of payloads at a point in time. Novel payloads, payloads in new modalities, and payloads crafted after the eval was run are not covered. Conversely, relying only on inference-time scanning without red-teaming means you do not know which payloads your scanner misses until an attacker finds them. Both layers are necessary.

The multimodal red-team corpus: FigStep, AgentTypo, WhisperInject

A complete multimodal red-team corpus covers three attack classes:

FigStep (arXiv:2311.05608) — typographic prompt injection via anti-OCR rasterised text. Attack payloads are PNG images containing instructions rendered in fonts and sizes that OCR misses but VLMs read. Red-team corpus: a set of FigStep images with increasing difficulty (clear fonts → anti-OCR fonts → low contrast → mixed with benign content). The red-team question: does your production VLM follow instructions encoded in FigStep format? The inference-time question: does your scanner score FigStep images above threshold before they reach the model?

AgentTypo — glyph-distortion evolution of FigStep. Adds rotation, kerning jitter, Unicode confusables, and scale perturbations. OCR produces a benign or garbled transcript; the VLM reads the underlying instruction. Red-team corpus: AgentTypo-generated variants of a base instruction set.

WhisperInject (arXiv:2405.20653) — audio carrier injection that embeds instructions in frequency bands or waveform features that ASR transcription drops. Red-team corpus: WhisperInject audio samples at varying payload intensities and carrier frequencies. The red-team question: does your voice agent follow instructions embedded in WhisperInject audio? The inference-time question: does your audio scanner score WhisperInject samples above threshold?

Glyphward's curated payload corpus spans all three classes and is updated as new variants are published. When you run the Glyphward scanner in your Promptfoo CI suite (see below), you are implicitly testing your deployment against the current state of the published attack corpus.

CI integration: Promptfoo pointing at Glyphward as the oracle

Promptfoo is an eval-time red-team harness. You can point it at Glyphward as the "oracle" that evaluates whether an image input would be blocked in production:

# promptfoo.yaml — multimodal PI red-team eval with Glyphward as the scanner oracle
providers:
  - id: glyphward-scanner
    type: http
    config:
      url: https://glyphward.com/v1/scan
      method: POST
      headers:
        Authorization: "Bearer {{env.GLYPHWARD_API_KEY}}"
      body:
        image: "{{image_b64}}"  # base64-encoded test payload image
        source: "red_team_eval"
      transformResponse: "json.score"  # extract the numeric score

tests:
  - description: FigStep payload should be blocked (score >= 70)
    vars:
      image_b64: "{{readFile('payloads/figstep_jailbreak_v1.png') | base64}}"
    assert:
      - type: javascript
        value: "output >= 70"

  - description: Benign image should pass (score < 70)
    vars:
      image_b64: "{{readFile('payloads/benign_cat_photo.jpg') | base64}}"
    assert:
      - type: javascript
        value: "output < 70"

  - description: AgentTypo variant should be blocked
    vars:
      image_b64: "{{readFile('payloads/agenttypo_v1.png') | base64}}"
    assert:
      - type: javascript
        value: "output >= 70"

Add this test suite to your CI pipeline. On every PR that changes your VLM integration or your scan threshold configuration, the suite verifies that the Glyphward scanner correctly classifies your red-team payload corpus. See the detailed Promptfoo integration guide at Promptfoo + multimodal scanning.

Get early access

GitHub Actions: automated multimodal security CI

Add multimodal PI scanning to your CI pipeline as a pre-merge gate:

# .github/workflows/multimodal-security.yml
name: Multimodal AI security scan

on:
  pull_request:
    paths:
      - 'src/ai/**'          # trigger on VLM integration changes
      - 'tests/payloads/**'  # trigger on test corpus updates

jobs:
  scan-red-team-corpus:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install dependencies
        run: pip install httpx pillow pymupdf

      - name: Run multimodal PI scan on test corpus
        env:
          GLYPHWARD_API_KEY: ${{ secrets.GLYPHWARD_API_KEY }}
        run: |
          python tests/scan_corpus.py \
            --corpus-dir tests/payloads/figstep \
            --threshold 70 \
            --expect-flagged
          python tests/scan_corpus.py \
            --corpus-dir tests/payloads/benign \
            --threshold 70 \
            --expect-clean

      - name: Verify scan threshold config
        run: python tests/verify_threshold_config.py

The scan_corpus.py script scans every image in the specified directory and asserts the expected outcome (flagged or clean). This verifies both that the scanner correctly identifies known-bad payloads (attack-recall) and that it does not over-block known-good images (false-positive rate). Set --expect-flagged for red-team payload directories and --expect-clean for benign image directories.

Runtime monitoring: anomaly alerting and score distribution

Beyond blocking individual requests, the Glyphward scan score distribution over time is a signal for detecting active attack campaigns:

import json, datetime

def process_scan_result(scan: dict, request_id: str, user_id: str):
    """Route scan results to monitoring and alerting systems."""
    score = scan["score"]
    # Structured log to SIEM
    event = {
        "event": "image_pi_scan",
        "request_id": request_id,
        "user_id": user_id,
        "scan_id": scan["scan_id"],
        "score": score,
        "modality": scan.get("modality", "image"),
        "flagged_region": scan.get("flagged_region"),
        "ts": datetime.datetime.utcnow().isoformat() + "Z",
    }
    print(json.dumps(event))  # structured log to SIEM ingestion

    # Alerting tiers
    if score >= 85:
        # High-confidence attack: page on-call immediately
        alert_oncall(f"High-confidence PI attack: score={score} request={request_id}")
    elif score >= 70:
        # Above threshold: blocked, log to security queue for review
        queue_for_security_review(event)
    elif score >= 50:
        # Borderline: log for statistical analysis
        log_borderline(event)

A sudden spike in scan scores above 50 — even if no individual request exceeds the blocking threshold of 70 — may indicate an attacker probing your threshold with calibration payloads. This aligns with the monitoring requirements in ISO 27001 A.8.16 (Monitoring Activities — anomalous application behaviour) and SOC 2 CC7.2 (monitoring for anomalies indicative of malicious acts).

How eval-time and inference-time controls complement each other

Property	Promptfoo / Garak (eval-time)	Glyphward (inference-time)
When it runs	Pre-deployment, in CI	Every production request
What it tests	Model response to known payloads	Image/audio bytes for adversarial content
Coverage of novel payloads	Only if in the test corpus	Corpus updated continuously
Handles indirect PI	No (test harness, not runtime)	Yes (scans tool results, retrieved content)
Compliance evidence	Test results (pre-deployment)	Per-request scan_id (operating evidence)
Blocks production attacks	No	Yes
Identifies model weaknesses	Yes	No (scanner, not evaluator)

The evaluation test corpus is a point-in-time snapshot; production attacks adapt. Inference-time scanning against a continuously updated payload corpus covers the drift between your last red-team run and the current threat landscape. Both layers together satisfy the "test" and "detect" verbs in the security control vocabularies — OWASP's "test for prompt injection" and "detect in production" are two separate requirements in the OWASP LLM01:2025 control architecture.

Related questions

Do I need red-team evals if I already have inference-time scanning?

Yes. Red-team evals tell you which attack classes your scanner catches and which it misses. If a new attack class appears (a novel typographic technique, a new audio carrier method) and your scanner does not yet cover it, the red-team eval will surface that gap before it reaches production. Inference-time scanning is a defence against known and known-similar attacks; red-teaming maps the boundary of "known-similar". Run them both.

What multimodal red-team tools exist beyond Promptfoo?

Garak (NVIDIA, open-source) has probes for prompt injection and some multimodal attack classes. HarmBench includes multimodal jailbreak benchmarks. Anthropic's red-team evaluation framework includes multimodal attack evaluation. PyRIT (Microsoft, open-source) has an image attack module. These tools are all eval-time — they produce pass/fail test results, not per-request production evidence. Glyphward is the inference-time complement to any of them.

Can I use Glyphward's API to automate a red-team corpus scan?

Yes. Glyphward's /v1/scan endpoint is a standard REST API — you can POST your entire red-team corpus programmatically and collect the score distribution. This is useful for calibrating your threshold (checking what fraction of benign images score above various thresholds to estimate false-positive rates) and for regression testing (ensuring the scanner still flags known-bad payloads after a model or threshold update). See the Promptfoo integration example above for the configuration pattern.

How does multimodal AI security testing relate to MITRE ATLAS?

MITRE ATLAS catalogs LLM Prompt Injection (AML.T0051) and LLM Jailbreak (AML.T0054) as adversarial ML techniques. Red-team evals correspond to the "adversary simulation" / "pre-deployment testing" defensive measure in the ATLAS matrix. Inference-time scanning corresponds to the "runtime detection" / "input validation" defensive measure. A mature multimodal AI security programme maps both layers to the ATLAS defensive measures for AML.T0051 and T0054, with evidence that each layer is in place.

What scan threshold should I use in CI vs production?

In CI red-team evals, you typically want to assert that your known-bad payload corpus scores above the threshold (confirming detection) and your benign corpus scores below (confirming low false-positive rate). Use the same threshold value as your production configuration — the CI test should faithfully replicate the production gate. If you use different thresholds in CI and production, a passing CI eval does not guarantee production behaviour. For production, see the threshold guidance per context (70 for user uploads, 60 for ingestion, 50 for agentic tool results) in the framework integration pages.