ICP-by-platform · OpenAI GPT-4o

Prompt-injection scanner for GPT-4o vision

GPT-4o is the most widely deployed multimodal LLM — it processes images via the image_url and base64 content-block types in the Chat Completions API. OpenAI's moderation layer (the /v1/moderations endpoint) detects harmful text content and certain categories of harmful image content — but it is not a prompt-injection detector for images. A FigStep-class adversarial image submitted to a GPT-4o vision endpoint passes OpenAI's content moderation and delivers its instruction payload directly to the model's token stream. Your application receives a response that follows the adversarial instruction rather than your system prompt — without any flag, error, or indication from the API that the input was adversarial. The defence must be applied at the application layer, before the chat.completions.create call.

TL;DR

Before calling client.chat.completions.create with any user-supplied or externally-sourced image content block, POST the image bytes to Glyphward's /v1/scan. Score ≥ 70 → reject the input, do not call the OpenAI API. Score < 70 → proceed as normal. Under 200 ms scan overhead. Works for both url and base64 image block formats. Free tier — 10 scans/day, no card.

Why OpenAI's own moderation does not cover this

OpenAI's moderation API (/v1/moderations) classifies text and images into harm categories: hate, harassment, self-harm, sexual, violence, and their sub-categories. These categories describe the social harm classification of the content. A FigStep-class adversarial image does not produce a harmful-content-category hit — the image visually appears to be a normal photograph or document, with adversarial instructions encoded in the visual layer at a frequency or spatial scale that the vision encoder reads but that is invisible to the harm classifier. The harm classifier looks at the image as a whole and sees benign content. The vision encoder reads the pixel stream and sees an instruction. These are two different views of the same image bytes.

OpenAI's Claude equivalent and other providers face the same architectural limitation: content moderation is a harm-category classifier trained on explicit content, not a prompt-injection detector for the visual token stream. PI detection requires a different model purpose-trained on adversarial visual patterns.

The gap is explicitly acknowledged in the OWASP LLM Top 10 v2025 and in the MITRE ATLAS adversarial-ML threat catalogue. OpenAI's system card for GPT-4V (the predecessor to GPT-4o) notes that prompt injection via image is a known attack class that they continue to work on at the model training level — it is not solved at the API level.

Attack surface in GPT-4o applications

User-uploaded images in chatbots. Any GPT-4o chatbot that allows users to upload images accepts untrusted visual input. Users can upload screenshots, photos, documents, or synthetically generated images containing adversarial instructions. The instruction payload in the uploaded image can override the system prompt, cause the model to exfiltrate conversation history, or cause it to output content that violates your application's business rules.

GPT-4o in agentic pipelines reading screens. Computer-use agents that pass screenshots of desktop or browser UIs to GPT-4o for action planning have a screenshot-PI attack surface. A malicious web page or application can render adversarial text (white text on white background, or text in the CSS/canvas layer readable by the vision encoder) that instructs the agent to take unintended actions. See prompt-injection scanner for screenshot agents.

Retrieval-augmented generation with image-containing documents. RAG pipelines that use GPT-4o to analyse image-containing PDF documents — scanned reports, embedded charts, product catalogues — pass those embedded images to the model during retrieval. Documents sourced from external web pages, supplier feeds, or user uploads are untrusted. A payload embedded in a chart in an external document reaches the model during every retrieval call that returns that document. See indirect prompt injection via image.

GPT-4o Assistants API with file attachments. The OpenAI Assistants API supports file attachments that are passed to the model. When users upload images or PDFs to an Assistant thread, those files are processed by the model's vision capability. Each user-submitted file is an untrusted input. Scan before upload using the /v1/scan endpoint.

Real-time audio + vision in GPT-4o multi-modal sessions. GPT-4o's real-time API supports combined audio and vision sessions. The video frame stream is an additional attack surface — a malicious scene or screen share can deliver adversarial instructions via the video input. This is an emerging attack surface; Glyphward's per-frame scan capability applies here.

Python integration: scan before chat.completions.create

import base64, httpx
from openai import OpenAI
from pathlib import Path

openai_client = OpenAI()  # uses OPENAI_API_KEY env var
GLYPHWARD_API_KEY = "YOUR_GLYPHWARD_API_KEY"
GLYPHWARD_SCAN_URL = "https://glyphward.com/v1/scan"
SCAN_THRESHOLD = 70

def scan_image(image_bytes: bytes, source: str = "user_upload") -> dict:
    encoded = base64.b64encode(image_bytes).decode()
    resp = httpx.post(
        GLYPHWARD_SCAN_URL,
        headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
        json={"image": encoded, "source": source},
        timeout=5.0,
    )
    resp.raise_for_status()
    return resp.json()  # {score, flagged_region, scan_id, modality}

def safe_vision_completion(
    image_bytes: bytes,
    user_text: str,
    system_prompt: str,
    source: str = "user_upload",
) -> str:
    """Scan image, then call GPT-4o only if clean."""
    scan = scan_image(image_bytes, source)
    if scan["score"] >= SCAN_THRESHOLD:
        raise ValueError(
            f"Image blocked: PI score {scan['score']} >= {SCAN_THRESHOLD}. "
            f"scan_id={scan['scan_id']}"
        )
    # Only reach here if scan passed
    b64 = base64.b64encode(image_bytes).decode()
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": user_text},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        max_tokens=1000,
    )
    return response.choices[0].message.content

For applications that receive image URLs (not bytes), fetch the URL content first, scan the bytes, then pass the original URL to the OpenAI API (avoid re-encoding URLs you don't control):

def safe_vision_completion_from_url(
    image_url: str,
    user_text: str,
    system_prompt: str,
) -> str:
    """Fetch image URL, scan bytes, then pass URL to GPT-4o if clean."""
    img_bytes = httpx.get(image_url, follow_redirects=True, timeout=10.0).content
    scan = scan_image(img_bytes, source="external_url")
    if scan["score"] >= SCAN_THRESHOLD:
        raise ValueError(
            f"External image blocked: score={scan['score']}, "
            f"scan_id={scan['scan_id']}, url={image_url}"
        )
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": user_text},
                    {"type": "image_url", "image_url": {"url": image_url}},
                ],
            },
        ],
    )
    return response.choices[0].message.content

Get early access

TypeScript / Node.js integration

import OpenAI from "openai";
import axios from "axios";

const openai = new OpenAI();
const GLYPHWARD_API_KEY = process.env.GLYPHWARD_API_KEY!;
const SCAN_THRESHOLD = 70;

async function scanImage(imageBuffer: Buffer, source = "user_upload") {
  const { data } = await axios.post(
    "https://glyphward.com/v1/scan",
    { image: imageBuffer.toString("base64"), source },
    { headers: { Authorization: `Bearer ${GLYPHWARD_API_KEY}` }, timeout: 5000 }
  );
  return data as { score: number; scan_id: string; flagged_region: unknown };
}

export async function safeVisionCompletion(
  imageBuffer: Buffer,
  userText: string,
  systemPrompt: string
): Promise {
  const scan = await scanImage(imageBuffer);
  if (scan.score >= SCAN_THRESHOLD) {
    throw new Error(
      `Image blocked: PI score ${scan.score}. scan_id=${scan.scan_id}`
    );
  }
  const b64 = imageBuffer.toString("base64");
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: systemPrompt },
      {
        role: "user",
        content: [
          { type: "text", text: userText },
          { type: "image_url", image_url: { url: `data:image/png;base64,${b64}` } },
        ],
      },
    ],
  });
  return response.choices[0].message.content ?? "";
}

Coverage matrix

LayerDetects FigStep in image uploadDetects PI in RAG image documentsDetects PI in screenshot agent feedsBlocks before OpenAI API call
OpenAI moderation APINo (harm categories, not PI)NoNoNo (different endpoint)
GPT-4o system prompt instructionUnreliable (bypassed by payload)UnreliableUnreliableNo hard block
Lakera Guard (text)No (text only)No (text only)NoText only
Azure Prompt ShieldsNo (text only)NoNoText only, Azure-provider-gated
GlyphwardYes — pixel-levelYes — image scanYes — per-frame scanYes — scan before API call

Related questions

Does this work with the Assistants API and Responses API, not just Chat Completions?

Yes. For the Assistants API, scan image bytes before uploading to the Files API (client.files.create). Reject the file if the scan score exceeds threshold — do not upload it. For the Responses API (GPT-4o's newer stateful interface), scan before adding an image content block to the input array. The scan gate is always at the application layer, before any interaction with the OpenAI API.

Does GPT-4o-mini also support vision? Should I scan for that model too?

Yes — GPT-4o-mini supports vision input via the same image_url content-block mechanism as GPT-4o. The attack surface is identical. Scan applies to any model that accepts image inputs: GPT-4o, GPT-4o-mini, GPT-4V, and future OpenAI vision models. The Glyphward scan is model-agnostic — it analyses the image bytes, not the model that will process them.

What about o1 and o3 models with vision?

OpenAI's o1 and o3 reasoning models also support vision inputs via the same content-block API as GPT-4o. The reasoning chain in o1/o3 makes them particularly interesting targets for PI — a payload that successfully injects an instruction into the visual token stream may be further amplified by the extended reasoning process before output. Scan all vision inputs regardless of which OpenAI model is downstream.

Is there a risk of scan results being stale if I cache them?

Do not cache scan results against image hashes and re-use them for subsequent identical images. An attacker who learns that you cache scan results can craft an image that passes the scan once and then modifies the served image in a cache-poisoning pattern. Always scan the exact bytes at the exact time of the request. The scan_id is a point-in-time result for a specific byte sequence — treat it as such.

Should I scan images retrieved via GPT-4o's browsing or search tool?

GPT-4o's built-in web search tool retrieves text, not images in the vision-input sense — images are not injected into the model's vision context by the search tool. However, if your application retrieves images from URLs as part of an agent pipeline and passes them as image_url content blocks, those are external-origin images that should be scanned. The key question is: does your code explicitly pass image bytes or URLs to the OpenAI API as image content blocks? If yes, scan those bytes first.

Further reading