ICP-by-product · OpenAI Assistants API

Prompt-injection scanner for OpenAI Assistants API

The OpenAI Assistants API lets users attach files to threads — PDFs for the file_search tool, images for the vision-enabled run, code output from the code_interpreter tool. Every file you upload via POST /v1/files and attach to a thread message lands in the assistant's context without OpenAI's Moderation API ever reading the bytes for prompt-injection content. The Moderation API checks for hate, violence, and adult content — not for FigStep-class typographic jailbreak payloads or AgentTypo-class glyph distortions that look innocuous to content moderation but deliver an instruction to the vision encoder. Scan the file bytes before they leave your server.

TL;DR

Between your user's file upload and your call to openai.files.create(), POST the bytes to Glyphward's /v1/scan. If the score exceeds your threshold, reject the upload and return an error to the user before the file ever reaches the OpenAI Files API. One POST, under 200 ms, returns a 0–100 risk score. Free tier: 10 scans/day, no card. Pro: 100,000/month at $29/mo. Start on the free tier while you wire the integration.

Why the OpenAI Files API is a PI surface

The Assistants API was designed to accept user-provided context. That is the product's value proposition — users upload their documents and the assistant answers questions about them. The same design feature is the attack surface.

Files attached to an Assistants API thread become part of the model's input in one of three ways:

  1. file_search tool. The file is chunked, embedded, and stored in a vector store. When a user asks a question, relevant chunks are retrieved and placed into the assistant's context window — the same indirect-PI channel as a RAG pipeline. An image embedded in a PDF whose OCR text is benign but whose pixel-layer contains a typographic instruction reaches the model when the surrounding chunk is retrieved.
  2. Vision-enabled runs. If the assistant's run configuration enables vision (GPT-4o), image files attached to the thread message are passed to the vision encoder on that turn. The image bytes go directly to the model — no extraction, no OCR, no intermediate text representation. A FigStep payload in a 30×30 pixel corner of an otherwise benign image is invisible to moderation and clearly legible to GPT-4o's vision encoder.
  3. code_interpreter output. When the code_interpreter tool generates a PNG, chart, or graph as part of its response, the image is written to the thread as a file and surfaces in subsequent turns. Code-generated artifacts carry the trust level of the inputs to the code — if the user prompted the code toward a specific plot or image output, that output may reflect attacker intent.

The scan goes before openai.files.create()

The correct intercept is server-side, before the file is uploaded to OpenAI. Once the file is in the OpenAI Files API, it can be attached to any thread in your organization. The pre-upload scan is a hard gate: if it fails, the file never reaches the Files API and cannot be attached to any thread.

import httpx
import openai

def safe_upload_file(file_bytes: bytes, filename: str, purpose: str = "assistants") -> str:
    """Upload a file to OpenAI after scanning for multimodal PI.
    Returns the file_id, or raises if the scan blocks it.
    """
    # Determine modality from extension
    ext = filename.rsplit(".", 1)[-1].lower()
    if ext in ("png", "jpg", "jpeg", "webp", "gif", "bmp"):
        modality = "image"
    elif ext in ("wav", "mp3", "ogg", "m4a"):
        modality = "audio"
    elif ext in ("pdf", "docx"):
        modality = "document"  # Glyphward extracts and scans embedded images
    else:
        modality = "image"  # default: scan as image

    import base64
    scan_resp = httpx.post(
        "https://api.glyphward.com/v1/scan",
        json={
            "data": base64.b64encode(file_bytes).decode(),
            "modality": modality,
            "source_trust": "low",  # user-uploaded = untrusted
        },
        headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
        timeout=5,
    )
    result = scan_resp.json()
    if result["score"] > 70:
        raise ValueError(
            f"File '{filename}' blocked: multimodal PI score {result['score']}. "
            f"Flagged region: {result.get('region')}."
        )

    # Clean — proceed with upload
    file_obj = openai.files.create(
        file=(filename, file_bytes),
        purpose=purpose,
    )
    return file_obj.id

Drop safe_upload_file() wherever you currently call openai.files.create(). The scan adds ~150–200 ms; the upload itself adds 200–500 ms depending on file size. From the user's perspective this is a single ~400 ms file-upload round-trip.

What the scan covers for different file types

Glyphward's document-mode scan extracts all embedded images from PDFs and DOCX files and scores each. For a 20-page PDF with six embedded images and three scanned pages, the scan returns the max score across all image components plus the scan's assessment of any scanned-page pixel layers. The caller sees one score; the blocking logic is the same regardless of where in the document the payload was found.

For raw image files (PNG, JPEG, WebP, GIF, BMP), the scan runs the full typographic PI scanner: FigStep-class glyph detection, AgentTypo-class distortion detection, low-contrast instruction detection, and region-of-interest localization. For audio files (WAV, MP3, OGG, M4A), the scan runs the waveform-carrier analysis that catches WhisperInject-class out-of-band instructions Whisper drops from the transcript.

Handling file_search vector stores

The Assistants API's file_search tool creates a vector store from the uploaded files. Chunking and embedding happen server-side at OpenAI. If the file contains a typographic PI payload embedded in an image, the OCR extraction that feeds the chunker may miss it — the chunk text is clean, but the source page image that triggered the chunk is not.

Pre-upload scanning catches this before any chunking happens. If the vector store was built before you wired the scan, the existing store may contain flagged content that the scan would have caught at ingestion. A one-time backfill — download each file from the Files API, scan it, and delete files that fail — is the remediation path. OpenAI's Files API supports listing and downloading files by ID; the backfill can be scripted in an afternoon.

Threads that accept image messages directly (vision runs)

In addition to file attachments, Assistants API threads accept image URLs in the content array of a message — the same format as the Chat Completions API. When a user-supplied URL or base64 image is included inline rather than uploaded via Files API, the pre-upload scan does not fire (there is no upload). The intercept point for inline images is the message-creation call:

def safe_add_message(thread_id: str, content: list) -> None:
    """Add a message to an Assistants thread after scanning inline images."""
    for part in content:
        if isinstance(part, dict) and part.get("type") == "image_url":
            url = part["image_url"].get("url", "")
            if url.startswith("data:image"):
                b64 = url.split(",", 1)[1]
                score = glyphward_scan_b64(b64, "image")
                if score > 70:
                    raise ValueError(f"Inline image blocked (score {score})")
            # URL-referenced images: fetch bytes first, then scan
    openai.beta.threads.messages.create(
        thread_id=thread_id, role="user", content=content
    )

For URL-referenced images, fetch the bytes before the message is created, scan them, and reject the message if the score fails. Do not rely on OpenAI fetching the image to check it — the fetch happens server-side at OpenAI during the run, not before the message is accepted.

How Glyphward fits

Glyphward's /v1/scan endpoint is a one-call gate: POST bytes, get a score. No SDK installation required — a single httpx.post() or fetch() call is the entire integration. The scan result includes the risk score, the flagged pixel region (for images), the time window (for audio), and per-signal confidences for FigStep, AgentTypo, and WhisperInject classes.

OpenAI's Moderation API remains on the text-content path — the user's typed message, the assistant's response text. Glyphward covers the files, the inline images, and the audio attachments that the Moderation API was not designed to inspect. They are complementary. See pricing comparison for the per-scan cost math and free tier details for getting started.

Get early access

Related questions

Does OpenAI's Moderation API catch prompt injection in images?

No. OpenAI's Moderation API scores for hate, violence, sexual, self-harm, and harassment content categories. A FigStep jailbreak payload on a white background is not objectionable content by any of those categories — it is a legible instruction that happens to be rendered as pixels. The Moderation API will return low scores on it. Prompt-injection detection and content moderation are different functions; both are needed for a fully guarded Assistants deployment.

What about the new OpenAI Responses API — does the same scan apply?

Yes. The Responses API (gpt-4o and later) accepts the same multimodal message format as the Assistants API — image URLs, base64 blobs, and file references. The scan placement is the same: intercept before the image or file bytes reach the OpenAI API surface. The safe_add_message() pattern above translates directly to a messages array pre-filter for the Responses API.

How do I handle the code_interpreter output images?

Code-interpreter output images are generated by OpenAI's infrastructure; you cannot pre-scan them before they appear. Instead, scan them when you fetch the run results from the thread and before you display or act on them. If a run step produces an image output that scores over threshold, drop it from the displayed thread and flag it for review.

Is there an OpenAI-official tool for this?

As of 2026, OpenAI does not publish a per-file prompt-injection scanner in the Files API. The Moderation API is text-only. This is the gap Glyphward closes — the bytes that reach GPT-4o's vision encoder through the Files API and thread messages, which the Moderation API was not built to inspect.

How is this different from what's needed for a chatbot with image upload?

The scan is identical. The difference is the downstream destination: a chatbot image upload goes to one Chat Completions call; an Assistants API file upload goes to a persistent thread that may be replayed, branched, and accessed by tools like file_search across multiple runs. The Assistants API pattern has a wider blast radius per file, which makes pre-upload scanning even more important.

Further reading