ICP-by-platform · OpenAI GPT-4o

Prompt-injection scanner for GPT-4o vision

GPT-4o is the most widely deployed multimodal LLM — it processes images via the image_url and base64 content-block types in the Chat Completions API. OpenAI's moderation layer (the /v1/moderations endpoint) detects harmful text content and certain categories of harmful image content — but it is not a prompt-injection detector for images. A FigStep-class adversarial image submitted to a GPT-4o vision endpoint passes OpenAI's content moderation and delivers its instruction payload directly to the model's token stream. Your application receives a response that follows the adversarial instruction rather than your system prompt — without any flag, error, or indication from the API that the input was adversarial. The defence must be applied at the application layer, before the chat.completions.create call.

TL;DR

Before calling client.chat.completions.create with any user-supplied or externally-sourced image content block, POST the image bytes to Glyphward's /v1/scan. Score ≥ 70 → reject the input, do not call the OpenAI API. Score < 70 → proceed as normal. Under 200 ms scan overhead. Works for both url and base64 image block formats. Free tier — 10 scans/day, no card.

Why OpenAI's own moderation does not cover this

OpenAI's moderation API (/v1/moderations) classifies text and images into harm categories: hate, harassment, self-harm, sexual, violence, and their sub-categories. These categories describe the social harm classification of the content. A FigStep-class adversarial image does not produce a harmful-content-category hit — the image visually appears to be a normal photograph or document, with adversarial instructions encoded in the visual layer at a frequency or spatial scale that the vision encoder reads but that is invisible to the harm classifier. The harm classifier looks at the image as a whole and sees benign content. The vision encoder reads the pixel stream and sees an instruction. These are two different views of the same image bytes.

OpenAI's Claude equivalent and other providers face the same architectural limitation: content moderation is a harm-category classifier trained on explicit content, not a prompt-injection detector for the visual token stream. PI detection requires a different model purpose-trained on adversarial visual patterns.

The gap is explicitly acknowledged in the OWASP LLM Top 10 v2025 and in the MITRE ATLAS adversarial-ML threat catalogue. OpenAI's system card for GPT-4V (the predecessor to GPT-4o) notes that prompt injection via image is a known attack class that they continue to work on at the model training level — it is not solved at the API level.

Attack surface in GPT-4o applications

User-uploaded images in chatbots. Any GPT-4o chatbot that allows users to upload images accepts untrusted visual input. Users can upload screenshots, photos, documents, or synthetically generated images containing adversarial instructions. The instruction payload in the uploaded image can override the system prompt, cause the model to exfiltrate conversation history, or cause it to output content that violates your application's business rules.

GPT-4o in agentic pipelines reading screens. Computer-use agents that pass screenshots of desktop or browser UIs to GPT-4o for action planning have a screenshot-PI attack surface. A malicious web page or application can render adversarial text (white text on white background, or text in the CSS/canvas layer readable by the vision encoder) that instructs the agent to take unintended actions. See prompt-injection scanner for screenshot agents.

Retrieval-augmented generation with image-containing documents. RAG pipelines that use GPT-4o to analyse image-containing PDF documents — scanned reports, embedded charts, product catalogues — pass those embedded images to the model during retrieval. Documents sourced from external web pages, supplier feeds, or user uploads are untrusted. A payload embedded in a chart in an external document reaches the model during every retrieval call that returns that document. See indirect prompt injection via image.

GPT-4o Assistants API with file attachments. The OpenAI Assistants API supports file attachments that are passed to the model. When users upload images or PDFs to an Assistant thread, those files are processed by the model's vision capability. Each user-submitted file is an untrusted input. Scan before upload using the /v1/scan endpoint.

Real-time audio + vision in GPT-4o multi-modal sessions. GPT-4o's real-time API supports combined audio and vision sessions. The video frame stream is an additional attack surface — a malicious scene or screen share can deliver adversarial instructions via the video input. This is an emerging attack surface; Glyphward's per-frame scan capability applies here.

Python integration: scan before chat.completions.create

import base64, httpx
from openai import OpenAI
from pathlib import Path

openai_client = OpenAI()  # uses OPENAI_API_KEY env var
GLYPHWARD_API_KEY = "YOUR_GLYPHWARD_API_KEY"
GLYPHWARD_SCAN_URL = "https://glyphward.com/v1/scan"
SCAN_THRESHOLD = 70

def scan_image(image_bytes: bytes, source: str = "user_upload") -> dict:
    encoded = base64.b64encode(image_bytes).decode()
    resp = httpx.post(
        GLYPHWARD_SCAN_URL,
        headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
        json={"image": encoded, "source": source},
        timeout=5.0,
    )
    resp.raise_for_status()
    return resp.json()  # {score, flagged_region, scan_id, modality}

def safe_vision_completion(
    image_bytes: bytes,
    user_text: str,
    system_prompt: str,
    source: str = "user_upload",
) -> str:
    """Scan image, then call GPT-4o only if clean."""
    scan = scan_image(image_bytes, source)
    if scan["score"] >= SCAN_THRESHOLD:
        raise ValueError(
            f"Image blocked: PI score {scan['score']} >= {SCAN_THRESHOLD}. "
            f"scan_id={scan['scan_id']}"
        )
    # Only reach here if scan passed
    b64 = base64.b64encode(image_bytes).decode()
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": user_text},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{b64}",
                            "detail": "high",
                        },
                    },
                ],
            },
        ],
        max_tokens=1000,
    )
    return response.choices[0].message.content

For applications that receive image URLs (not bytes), fetch the URL content first, scan the bytes, then pass the original URL to the OpenAI API (avoid re-encoding URLs you don't control):

def safe_vision_completion_from_url(
    image_url: str,
    user_text: str,
    system_prompt: str,
) -> str:
    """Fetch image URL, scan bytes, then pass URL to GPT-4o if clean."""
    img_bytes = httpx.get(image_url, follow_redirects=True, timeout=10.0).content
    scan = scan_image(img_bytes, source="external_url")
    if scan["score"] >= SCAN_THRESHOLD:
        raise ValueError(
            f"External image blocked: score={scan['score']}, "
            f"scan_id={scan['scan_id']}, url={image_url}"
        )
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": user_text},
                    {"type": "image_url", "image_url": {"url": image_url}},
                ],
            },
        ],
    )
    return response.choices[0].message.content

Get early access

TypeScript / Node.js integration

import OpenAI from "openai";
import axios from "axios";

const openai = new OpenAI();
const GLYPHWARD_API_KEY = process.env.GLYPHWARD_API_KEY!;
const SCAN_THRESHOLD = 70;

async function scanImage(imageBuffer: Buffer, source = "user_upload") {
  const { data } = await axios.post(
    "https://glyphward.com/v1/scan",
    { image: imageBuffer.toString("base64"), source },
    { headers: { Authorization: `Bearer ${GLYPHWARD_API_KEY}` }, timeout: 5000 }
  );
  return data as { score: number; scan_id: string; flagged_region: unknown };
}

export async function safeVisionCompletion(
  imageBuffer: Buffer,
  userText: string,
  systemPrompt: string
): Promise {
  const scan = await scanImage(imageBuffer);
  if (scan.score >= SCAN_THRESHOLD) {
    throw new Error(
      `Image blocked: PI score ${scan.score}. scan_id=${scan.scan_id}`
    );
  }
  const b64 = imageBuffer.toString("base64");
  const response = await openai.chat.completions.create({
    model: "gpt-4o",
    messages: [
      { role: "system", content: systemPrompt },
      {
        role: "user",
        content: [
          { type: "text", text: userText },
          { type: "image_url", image_url: { url: `data:image/png;base64,${b64}` } },
        ],
      },
    ],
  });
  return response.choices[0].message.content ?? "";
}

Coverage matrix

Layer	Detects FigStep in image upload	Detects PI in RAG image documents	Detects PI in screenshot agent feeds	Blocks before OpenAI API call
OpenAI moderation API	No (harm categories, not PI)	No	No	No (different endpoint)
GPT-4o system prompt instruction	Unreliable (bypassed by payload)	Unreliable	Unreliable	No hard block
Lakera Guard (text)	No (text only)	No (text only)	No	Text only
Azure Prompt Shields	No (text only)	No	No	Text only, Azure-provider-gated
Glyphward	Yes — pixel-level	Yes — image scan	Yes — per-frame scan	Yes — scan before API call