Security guide · All modalities

Prompt injection prevention best practices

Prompt injection is the OWASP LLM Top 10's number-one risk — and the only one where the attack payload can arrive via text, image, audio, or any combination. Every major LLM application security guide recommends "input validation" and "system prompt hardening," but these alone are insufficient. Text-only validation misses pixel-level injection (FigStep, AgentTypo, typographic prompt injection). System prompt hardening raises the bar but does not reliably prevent instruction override when adversarial content bypasses the text layer entirely. This guide covers the complete defence-in-depth stack, layer by layer, in the order they should appear in your request pipeline. Each layer handles a distinct attack class; no single layer is sufficient on its own.

The six-layer defence stack

  1. Layer 1 — Input format validation (reject malformed inputs at the edge)
  2. Layer 2 — Pre-LLM scan gate (detect adversarial content before the model call)
  3. Layer 3 — System prompt hardening (raise the instruction-override threshold)
  4. Layer 4 — Privilege separation (limit what the model can do even if injected)
  5. Layer 5 — Output encoding and output-handling guards (prevent injected output from causing downstream harm)
  6. Layer 6 — Audit logging (detect incidents after the fact; support compliance)

Layer 1 — Input format validation

Before any model call, validate that the input matches the expected schema. This is cheap, fast, and eliminates entire classes of attack that rely on malformed inputs reaching the model.

For image inputs: validate file type (check the magic bytes, not just the extension), enforce a maximum file size (4 MB is a reasonable cap for most use cases), enforce a maximum resolution (1 500 × 1 500 px before model preprocessing), and check that the image is actually decodable (PIL Image.verify() or equivalent). Reject images that cannot be decoded; they may be crafted to exploit parser vulnerabilities in the image codec.

For audio inputs: validate sample rate (reject inputs outside the model's supported range), enforce a maximum duration (30 seconds for voice commands, 5 minutes for dictation), and check that the audio is decodable with the expected codec. Abnormally short audio files with unusual sample rates are a signature of WhisperInject-style attack probes.

For text inputs: enforce a maximum token count before the model call (not the API's limit — your application's tighter limit). Reject inputs that contain null bytes, unrecognised Unicode control characters, or injection-pattern signatures (e.g. "Ignore previous instructions").

from PIL import Image
import io, magic, os

MAX_FILE_BYTES = 4 * 1024 * 1024  # 4 MB
MAX_PIXELS = 1_500_000
ALLOWED_MIME_TYPES = {"image/jpeg", "image/png", "image/gif", "image/webp"}

def validate_image(raw_bytes: bytes) -> bytes:
    if len(raw_bytes) > MAX_FILE_BYTES:
        raise ValueError("Image exceeds maximum size")
    mime = magic.from_buffer(raw_bytes, mime=True)
    if mime not in ALLOWED_MIME_TYPES:
        raise ValueError(f"Unsupported image type: {mime}")
    try:
        img = Image.open(io.BytesIO(raw_bytes))
        img.verify()  # raises on corrupt or truncated files
    except Exception:
        raise ValueError("Image could not be decoded")
    # Re-open after verify (verify() closes the image)
    img = Image.open(io.BytesIO(raw_bytes))
    w, h = img.size
    if w * h > MAX_PIXELS:
        scale = (MAX_PIXELS / (w * h)) ** 0.5
        img = img.resize((int(w * scale), int(h * scale)), Image.LANCZOS)
        buf = io.BytesIO()
        img.save(buf, format="PNG")
        return buf.getvalue()
    return raw_bytes

Layer 2 — Pre-LLM scan gate (multimodal)

After format validation, scan image and audio inputs for adversarial injection content before the model call. Text-only guards (LLM Guard, Lakera Guard, Azure Prompt Shields) inspect the text layer only; they cannot detect payloads hidden in pixels or waveforms. The scan gate must handle all input modalities.

import base64, requests, os

GLYPHWARD_KEY = os.environ["GLYPHWARD_API_KEY"]
INJECTION_THRESHOLD = 65

def scan_image(image_bytes: bytes, source: str = "user_upload") -> None:
    """Raises ValueError if the image is adversarial. Silent on safe images."""
    resp = requests.post(
        "https://glyphward.com/v1/scan",
        json={"image": base64.b64encode(image_bytes).decode(), "source": source},
        headers={"Authorization": f"Bearer {GLYPHWARD_KEY}"},
        timeout=8,
    )
    resp.raise_for_status()
    result = resp.json()
    if result["score"] >= INJECTION_THRESHOLD:
        raise ValueError(
            f"Image rejected by security scan (score={result['score']}, scan_id={result['scan_id']})"
        )

Use the source field to distinguish user uploads from retrieved document images from tool outputs — this lets you tune thresholds per source. User uploads from anonymous users warrant a lower threshold (50–60) than internally ingested documents from trusted sources (70–80). See the real-time vs batch guide for latency trade-offs.

Layer 3 — System prompt hardening

System prompt hardening is the most widely recommended defence and the most widely misunderstood. It is a soft control — it raises the threshold for instruction override but does not prevent injection when the adversarial content bypasses the text layer. Use it as one layer of a stack, not a standalone defence.

Effective hardening patterns:

Ineffective patterns (do not rely on these alone):

Layer 4 — Privilege separation

Design the system so that even a fully successful injection can cause minimal harm. This means: do not give the LLM tools it does not need for the current task.

Layer 5 — Output encoding and output-handling guards

Even if the injection succeeds (the model follows adversarial instructions), output encoding can prevent the injected response from causing downstream harm. This is analogous to XSS prevention via output escaping.

Layer 6 — Audit logging

Log every scan result, every model call (with the sanitised input hash, not raw content), every tool call, and every tool result. Effective audit logs enable:

Get early access

Defence stack coverage matrix

Attack type Layer 1 (format validation) Layer 2 (scan gate) Layer 3 (system prompt) Layer 4 (privilege separation)
FigStep / AgentTypo (text-in-image) No — valid image format Yes — pixel-level classifier Partial — probabilistic Limits blast radius
WhisperInject (audio) No — valid audio format Yes — waveform anomaly classifier No — bypass at transcript layer Limits blast radius
Direct text injection ("ignore previous instructions") Partial — pattern matching Partial — text scan Yes — main defence Limits blast radius
Indirect injection via retrieved document No Yes — scan at retrieval Partial Limits blast radius
Model DoS via adversarial texture Partial — resolution cap Yes — complexity score No No

Related questions

Is system prompt hardening sufficient on its own?

No. System prompt hardening is a probabilistic control — it reduces the rate at which injections succeed but does not provide a hard block. For text-layer injections it is highly effective; for pixel-layer injections (FigStep, AgentTypo, typographic PI) it offers limited protection because the model cannot reliably distinguish an instruction rendered in pixel form from a benign image. The only hard block for pixel-layer injection is a pre-LLM scan gate that operates on the image bytes before the model call. Use system prompt hardening as Layer 3, not as a substitute for Layer 2.

What order should these layers be applied?

The layers should be applied in the order listed: format validation first (cheapest, eliminates malformed inputs before any expensive processing), then scan gate (finds adversarial content in valid inputs), then system prompt (raises the threshold for what gets through), then privilege separation (limits blast radius), then output encoding (prevents downstream harm), then audit logging (detects what got through). Do not skip to Layer 3 because "Layer 2 is complex." Every omitted layer leaves a gap an attacker will find.

Which of these layers addresses OWASP LLM Top 10 risks?

Layer 2 (scan gate) addresses LLM01 (Prompt Injection) and LLM04 (Model DoS for the multimodal sub-case). Layer 4 (privilege separation) addresses LLM06 (Excessive Agency) and partially LLM07 (Insecure Plugin Design). Layer 5 (output encoding) addresses LLM02 (Insecure Output Handling). Layer 6 (audit logging) supports compliance obligations named in LLM09 (Overreliance). For a full mapping, see the OWASP LLM01 page.

How does this relate to the NIST AI RMF and the EU AI Act?

The NIST AI RMF's GOVERN, MAP, MEASURE, and MANAGE functions map closely to this stack: MAP identifies the injection threat; MEASURE implements monitoring (Layer 6); MANAGE implements the scan gate and privilege controls (Layers 2–4). The EU AI Act Article 15 requires "robustness, accuracy and cybersecurity" for high-risk AI systems — an injection attack that causes an AI system to make an incorrect automated decision triggers this obligation. Layers 2 and 6 together (scan gate + audit log) are the most defensible response to an Article 15 audit query about prompt injection controls. See the EU AI Act Article 15 page for details.

Further reading