ICP-by-platform · AWS Bedrock

Prompt-injection scanner for AWS Bedrock

Amazon Bedrock lets you invoke Claude, Llama, Titan, and other foundation models through a single managed API. When you pass image content blocks to a vision-capable model — Claude 3 Sonnet, Haiku, or Opus via bedrock:InvokeModel — those image bytes reach the model's vision encoder directly. Bedrock Guardrails, announced in 2023 and generally available in 2024, apply text-based content policies (topic denial, sensitive information redaction, grounding checks, word filters). They do not inspect the pixel layer of an image content block for FigStep-class typographic jailbreak instructions or AgentTypo-class glyph-distortion payloads. Scan those bytes before the InvokeModel call.

TL;DR

Between receiving a user-uploaded image and calling bedrock_runtime.invoke_model(), POST the image bytes to Glyphward's /v1/scan. If the risk score exceeds your threshold, reject the request before the image ever reaches the Bedrock model endpoint. One POST, under 200 ms, returns a 0–100 score and the flagged pixel region. Free tier: 10 scans/day, no card. Pro: 100,000/month at $29/mo. Start on the free tier while you wire the boto3 intercept.

Why Bedrock Guardrails don't close the multimodal PI gap

Bedrock Guardrails are a powerful content-safety layer. They can deny responses that match a topic (e.g., "competitor pricing"), redact PII categories (SSNs, credit card numbers), apply grounding checks against retrieved context, and block profanity or specific word lists. For text-based prompt injection — a user who types "ignore previous instructions" — a well-configured denied-topic Guardrail can intercept that in the input.

The gap opens when the injection is carried in an image. A FigStep attack renders the malicious instruction as text inside the image — characters drawn in a font that OCR misreads as benign but Claude's vision encoder reads as a clear instruction. Guardrails' input analysis reads the text fields in the request body (the "text" content blocks and the conversation history). It does not extract and analyse the rendered text inside the image bytes of an "image" content block. The Guardrail configuration does not currently include an image-PI detector — that capability is not in the Guardrails feature set.

The result: a user who submits an image with a typographic prompt injection payload passes Guardrails' input check because the injected instruction is invisible to the text-analysis path. The image content block reaches Claude's vision encoder, which reads the instruction clearly.

The boto3 intercept pattern

The correct intercept is server-side, before the request is dispatched to the Bedrock runtime endpoint. The following wrapper handles Claude-format content blocks (the anthropic_version: bedrock-2023-05-31 payload schema). The same pattern extends to Llama 3 vision and Amazon Titan Multimodal Embeddings if your pipeline uses those models for image inputs.

import boto3
import httpx
import base64
import json
import os

GLYPHWARD_API_KEY = os.environ["GLYPHWARD_API_KEY"]
bedrock_runtime = boto3.client("bedrock-runtime", region_name="us-east-1")

def _scan_image_bytes(img_bytes: bytes, label: str) -> None:
    """Raise if the image contains a multimodal PI payload."""
    resp = httpx.post(
        "https://api.glyphward.com/v1/scan",
        json={
            "data": base64.b64encode(img_bytes).decode(),
            "modality": "image",
            "source_trust": "low",
        },
        headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
        timeout=5,
    )
    result = resp.json()
    if result["score"] > 70:
        raise ValueError(
            f"{label}: multimodal PI score {result['score']} "
            f"(region: {result.get('region')})"
        )

def safe_invoke_claude_on_bedrock(messages: list, model_id: str = "anthropic.claude-3-sonnet-20240229-v1:0") -> dict:
    """Invoke a Claude model on Bedrock after scanning all image content blocks.

    messages: list of Anthropic-format message dicts with content arrays.
    Returns the parsed Bedrock response body.
    """
    for msg in messages:
        for i, block in enumerate(msg.get("content", [])):
            if not isinstance(block, dict):
                continue
            if block.get("type") != "image":
                continue
            src = block["source"]
            if src["type"] == "base64":
                img_bytes = base64.b64decode(src["data"])
            elif src["type"] == "url":
                img_bytes = httpx.get(src["url"], timeout=10).content
            else:
                continue
            _scan_image_bytes(img_bytes, f"message[{msg['role']}].content[{i}]")

    payload = {
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1024,
        "messages": messages,
    }
    response = bedrock_runtime.invoke_model(
        modelId=model_id,
        body=json.dumps(payload),
        contentType="application/json",
        accept="application/json",
    )
    return json.loads(response["body"].read())

Drop safe_invoke_claude_on_bedrock() wherever you currently call bedrock_runtime.invoke_model() with a Claude model. The scan adds ~150–200 ms per image block; Bedrock's own network round-trip adds 300–800 ms depending on model size and prompt length. The total latency overhead is imperceptible relative to the model's inference time.

Bedrock Knowledge Bases: the indirect-PI surface

Bedrock Knowledge Bases provides fully managed RAG: you point it at an S3 bucket, it chunks and embeds the documents into a vector store (OpenSearch Serverless or Pinecone), and exposes a RetrieveAndGenerate API. The chunking pipeline handles PDFs, Word documents, HTML, and Markdown. PDFs are chunked by text extraction — the text content of each page is extracted and embedded. Embedded images inside the PDF are not independently scanned for prompt-injection payloads.

A document that enters your S3 data source with a typographic PI payload embedded in an image on one of its pages is chunked, stored in the vector store, and returned as grounding context when a user's query retrieves the relevant chunk. The retrieved context arrives in Claude's context window as a trusted "retrieved document" — the model gives it the same epistemic weight as a system instruction. This is the indirect-PI vector via Knowledge Bases.

The correct remediation is a pre-ingestion scan placed in your S3 upload pipeline. Before any document lands in the data source bucket:

Extract all embedded images from the PDF (using pymupdf or pdfplumber).
POST each image to Glyphward's /v1/scan endpoint.
If any image scores above threshold, quarantine the document and alert your team. Do not write it to the data source bucket.
Log the scan ID, document hash, and ingestion decision as your LLM03-aligned provenance record — the same evidence your SOC 2 or ISO 27001 auditor will ask for.

This is the same architecture described in the OWASP LLM03:2025 RAG corpus poisoning page — the threat model is identical; only the managed-service wrapper changes.

Bedrock Agents: multimodal inputs in agent turns

Bedrock Agents orchestrates multi-step reasoning using Claude as the underlying model. An agent turn can include a sessionState.files[] array with user-uploaded files — images, PDFs, spreadsheets — that are injected into the agent's context for the current step. These files bypass Guardrails' text-analysis path in the same way that direct InvokeModel image content blocks do.

The intercept for Bedrock Agents is at the InvokeAgent call site. Before dispatching the agent action, scan each element of inputFiles (for the older Agents API) or sessionState.files[]:

def safe_invoke_agent(agent_id: str, agent_alias_id: str, session_id: str,
                      input_text: str, input_files: list | None = None) -> str:
    """Invoke a Bedrock Agent, scanning any image files first."""
    bedrock_agents_runtime = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

    if input_files:
        for f in input_files:
            if f.get("source", {}).get("sourceType") == "BYTE_CONTENT":
                content = f["source"]["byteContent"]
                img_bytes = base64.b64decode(content["data"])
                media = content.get("mediaType", "")
                if media.startswith("image/"):
                    _scan_image_bytes(img_bytes, f"agent input file: {f.get('name', '?')}")

    response = bedrock_agents_runtime.invoke_agent(
        agentId=agent_id,
        agentAliasId=agent_alias_id,
        sessionId=session_id,
        inputText=input_text,
        sessionState={"files": input_files} if input_files else {},
    )
    output_text = ""
    for event in response["completion"]:
        if "chunk" in event:
            output_text += event["chunk"]["bytes"].decode()
    return output_text

What Guardrails cover vs what Glyphward covers

Bedrock Guardrails and Glyphward are complements, not competitors. Use both:

Bedrock Guardrails: text-topic denial (keep the model on-task), PII redaction (remove sensitive data from responses), grounding check (detect hallucinations relative to retrieved context), word filters (block specific terms). Runs server-side inside Bedrock; requires no additional outbound call.
Glyphward: multimodal PI detection for the image bytes and audio waveforms that Guardrails does not inspect. Runs client-side in your application, before the InvokeModel or InvokeAgent call. Returns a 0–100 risk score, a flagged pixel region, and per-signal confidences for FigStep, AgentTypo, and WhisperInject attack classes.

The per-request Glyphward scan log (scan ID, risk score, flagged region, modality, timestamp) is also your evidence trail for AWS Shared Responsibility Model obligations, SOC 2 CC6.6, and ISO 27001 A.8.28 input-validation controls. Bedrock Guardrails logs to CloudWatch; Glyphward logs can be forwarded to the same CloudWatch log group via webhook to keep your audit trail consolidated.

Get early access

Related questions

Does Bedrock Guardrails detect prompt injection in images?

No. Bedrock Guardrails applies text-based analysis: denied-topic classifiers, PII detectors, grounding checks against retrieved context, and profanity/word-list filters. These operate on the text fields in the request body and the conversation history. The pixel-layer content of an image content block is not analysed by Guardrails for prompt-injection payloads. A FigStep jailbreak rendered inside an image will score 0 on all Guardrails content policies because it is not adult content, hate speech, violence, or PII — it is rendered instruction text.

Which Claude models on Bedrock accept image inputs?

As of 2026, Claude 3 Haiku, Sonnet, and Opus (all anthropic.claude-3-* model IDs) accept image content blocks via the Anthropic-format messages API on Bedrock. Claude 3.5 models also accept images. Claude 2 and Claude Instant are text-only. If you are on an older model ID, verify in the Bedrock documentation whether vision capability is supported before adding image content blocks to your requests.

What about Titan Multimodal Embeddings — is that a PI surface?

Amazon Titan Multimodal Embeddings (amazon.titan-embed-image-v1) converts images to vector embeddings for search and RAG use cases. The model does not generate text from image inputs, so there is no direct text-output PI surface on the embeddings endpoint itself. The PI risk is downstream: if an attacker can influence which images are embedded and retrieved, they can poison the retrieval corpus that feeds a generative model (the OWASP LLM03 indirect vector). Scan images at ingestion time, before they are embedded.

Does this work with Bedrock cross-region inference?

Yes. Cross-region inference changes which AWS region processes the request; the Bedrock runtime API surface and content block format are identical. The Glyphward scan happens in your application, before the Bedrock API call, regardless of which region endpoint you target. The scan latency (150–200 ms) is the same; Bedrock's cross-region round-trip adds additional latency on the back half of the request.

How do I handle audio inputs if the Bedrock model supports audio?

Amazon Bedrock does not currently expose audio-input content blocks in the public API (as of 2026, Nova Sonic processes audio via a separate streaming protocol). For speech-to-text pipelines that route through Transcribe before a Bedrock text call, the PI risk is at the Transcribe output boundary — see the voice-agents page for that pattern. If Bedrock adds audio content blocks in a future API revision, the Glyphward scan call is unchanged: POST bytes with modality: "audio" before the Bedrock call.