ICP-by-product · AutoGen multi-agent systems

Prompt-injection scanner for AutoGen agents

AutoGen's group-chat design sends every message to every participant. When one agent returns an image — a rendered chart, a browser screenshot, a diagram from a code executor — that image arrives in the context of every other agent in the group. A FigStep-class jailbreak payload embedded in the image reaches all of them. AutoGen's text-side guardrails, including the built-in content filtering hooks, operate on the message's text field. They do not read the image_url or file content blocks in the same message. The fix is a message-preprocessing hook that scans every multimodal content block before it enters the relay.

TL;DR

AutoGen v0.4 (AgentChat API) exposes on_messages preprocessing hooks. Wire a Glyphward scan there: extract every Image or Audio content block from the incoming ChatMessage, POST the bytes to /v1/scan, and raise a HandlerException if the score exceeds your threshold before the message reaches the agent's LLM call. For the legacy ConversableAgent pattern, override receive() with the same scan. Free tier: 10 scans/day, no card. Pro: 100,000/month at $29/mo.

How multimodal content flows through AutoGen

AutoGen's message format (both the v0.4 AgentChat MultiModalMessage and the legacy ConversableAgent dict-format) supports mixed-content messages: a single message can contain a text block, one or more image URLs or base64 blobs, and file attachments. The group-chat relay broadcasts this message as-is to every registered agent.

The agents that receive it pass the full message structure to their underlying LLM call. A GPT-4o or Claude 3 call that receives an image block runs its vision encoder on that image — the same encoder that indirect-prompt-injection research demonstrates can be tricked by typographic payloads in images that look visually innocuous to humans.

Four entry points introduce multimodal content into an AutoGen conversation:

User proxy initial message. When a human or a UserProxyAgent fires the first message with an attached image or file, the payload goes to every group-chat participant in the first round. Scan before initiate_chat().
Code-executor output. An ExecutorAgent that runs code producing a PNG, chart, or graph and embeds the output in its reply introduces a code-generated artifact into the conversation. Code execution is a powerful tool-call escalation surface: a crafted payload can survive through the code path unchanged. See the 2026 threat model for the tool-call escalation pattern.
Tool call results from external APIs. Agents with function tools that return images (a browser screenshot tool, a search result with embedded images, a chart-rendering API) introduce third-party-sourced bytes into the conversation. Third-party content carries implicit low source trust.
File attachments passed to a MultimodalConversableAgent. Documents, PDFs, and audio files passed via the file field trigger AutoGen's built-in file-reading path. The file bytes reach the LLM before any text extraction — if a PDF contains embedded images, those images are part of the LLM call payload.

AutoGen v0.4 — scan placement with the AgentChat API

AutoGen v0.4 restructures around the AssistantAgent / UserProxyAgent model with a cleaner message-processing API. The on_messages method on each agent accepts a list of ChatMessage objects; override it to intercept multimodal content:

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import MultiModalMessage
import httpx, base64

class ScannedAssistantAgent(AssistantAgent):
    async def on_messages(self, messages, cancellation_token):
        for msg in messages:
            if isinstance(msg, MultiModalMessage):
                for item in msg.content:
                    if hasattr(item, "data"):  # Image content block
                        score = await glyphward_scan(item.data, "image")
                        if score > 70:
                            raise ValueError(
                                f"Multimodal PI blocked (score {score}) — "
                                f"message from {msg.source} quarantined."
                            )
        return await super().on_messages(messages, cancellation_token)

async def glyphward_scan(b64_data: str, modality: str) -> int:
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "https://api.glyphward.com/v1/scan",
            json={"data": b64_data, "modality": modality, "source_trust": "low"},
            headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
            timeout=5,
        )
        return resp.json()["score"]

Apply ScannedAssistantAgent as the base for every agent in the group chat. The scan runs asynchronously per-agent per-message, in parallel with other pre-processing — marginal latency over a standard round-trip is typically under 200 ms per image block.

Legacy ConversableAgent — override receive()

For AutoGen v0.2 / v0.3 (the ConversableAgent pattern), the intercept point is the receive() method, which is called each time an agent gets a message before its generate_reply() runs:

class ScannedConversableAgent(ConversableAgent):
    def receive(self, message, sender, request_reply=None, silent=False):
        # message is a dict; content may be a list with image_url entries
        if isinstance(message.get("content"), list):
            for part in message["content"]:
                if part.get("type") == "image_url":
                    url_or_b64 = part["image_url"].get("url", "")
                    if url_or_b64.startswith("data:image"):
                        b64 = url_or_b64.split(",", 1)[1]
                        score = glyphward_scan_sync(b64, "image")
                        if score > 70:
                            raise Exception(f"PI blocked (score {score})")
        super().receive(message, sender, request_reply, silent)

The same pattern applies to any framework that uses the OpenAI message format with image_url content blocks — the scan placement is identical, only the hook method name changes.

Threat model: why agentic escalation is worse than single-LLM PI

In a single-agent setup, a successful prompt injection causes one model to misbehave once. In an AutoGen group chat, a successful injection causes every downstream agent to operate under the attacker's instruction for the rest of the conversation. A researcher agent that produces a maliciously crafted chart passes that chart to a writer agent, a QA agent, and an executor agent — all of whom see the embedded instruction in their context. The executor agent in particular may have code-execution, file-write, or API-call privileges, turning an image PI into arbitrary code execution.

This is the agentic escalation pattern documented in MITRE ATLAS T0051 (LLM Prompt Injection) and in the OWASP LLM01:2025 multimodal sub-category. The blast radius in a group-chat architecture is larger than in a direct-chat architecture by the number of agents in the group minus one.

How Glyphward fits

Glyphward's /v1/scan accepts image bytes (base64 PNG / JPEG / WebP) or audio bytes (WAV / MP3 / OGG), returns a 0–100 risk score, the flagged bounding region or time window, and per-signal confidences. The async Python client fits naturally into AutoGen v0.4's async agent architecture. The sync client works in legacy receive() overrides.

Existing text-side guardrails — Lakera Guard, Azure Prompt Shields, LLM Guard — remain on the text and function_call message paths. Glyphward closes the image_url, file, and audio paths those guards cannot reach. See pricing comparison for the full cost-per-scan analysis across providers.

Get early access