ICP-by-product · AutoGen multi-agent systems

Prompt-injection scanner for AutoGen agents

AutoGen's group-chat design sends every message to every participant. When one agent returns an image — a rendered chart, a browser screenshot, a diagram from a code executor — that image arrives in the context of every other agent in the group. A FigStep-class jailbreak payload embedded in the image reaches all of them. AutoGen's text-side guardrails, including the built-in content filtering hooks, operate on the message's text field. They do not read the image_url or file content blocks in the same message. The fix is a message-preprocessing hook that scans every multimodal content block before it enters the relay.

TL;DR

AutoGen v0.4 (AgentChat API) exposes on_messages preprocessing hooks. Wire a Glyphward scan there: extract every Image or Audio content block from the incoming ChatMessage, POST the bytes to /v1/scan, and raise a HandlerException if the score exceeds your threshold before the message reaches the agent's LLM call. For the legacy ConversableAgent pattern, override receive() with the same scan. Free tier: 10 scans/day, no card. Pro: 100,000/month at $29/mo.

How multimodal content flows through AutoGen

AutoGen's message format (both the v0.4 AgentChat MultiModalMessage and the legacy ConversableAgent dict-format) supports mixed-content messages: a single message can contain a text block, one or more image URLs or base64 blobs, and file attachments. The group-chat relay broadcasts this message as-is to every registered agent.

The agents that receive it pass the full message structure to their underlying LLM call. A GPT-4o or Claude 3 call that receives an image block runs its vision encoder on that image — the same encoder that indirect-prompt-injection research demonstrates can be tricked by typographic payloads in images that look visually innocuous to humans.

Four entry points introduce multimodal content into an AutoGen conversation:

  1. User proxy initial message. When a human or a UserProxyAgent fires the first message with an attached image or file, the payload goes to every group-chat participant in the first round. Scan before initiate_chat().
  2. Code-executor output. An ExecutorAgent that runs code producing a PNG, chart, or graph and embeds the output in its reply introduces a code-generated artifact into the conversation. Code execution is a powerful tool-call escalation surface: a crafted payload can survive through the code path unchanged. See the 2026 threat model for the tool-call escalation pattern.
  3. Tool call results from external APIs. Agents with function tools that return images (a browser screenshot tool, a search result with embedded images, a chart-rendering API) introduce third-party-sourced bytes into the conversation. Third-party content carries implicit low source trust.
  4. File attachments passed to a MultimodalConversableAgent. Documents, PDFs, and audio files passed via the file field trigger AutoGen's built-in file-reading path. The file bytes reach the LLM before any text extraction — if a PDF contains embedded images, those images are part of the LLM call payload.

AutoGen v0.4 — scan placement with the AgentChat API

AutoGen v0.4 restructures around the AssistantAgent / UserProxyAgent model with a cleaner message-processing API. The on_messages method on each agent accepts a list of ChatMessage objects; override it to intercept multimodal content:

from autogen_agentchat.agents import AssistantAgent
from autogen_agentchat.messages import MultiModalMessage
import httpx, base64

class ScannedAssistantAgent(AssistantAgent):
    async def on_messages(self, messages, cancellation_token):
        for msg in messages:
            if isinstance(msg, MultiModalMessage):
                for item in msg.content:
                    if hasattr(item, "data"):  # Image content block
                        score = await glyphward_scan(item.data, "image")
                        if score > 70:
                            raise ValueError(
                                f"Multimodal PI blocked (score {score}) — "
                                f"message from {msg.source} quarantined."
                            )
        return await super().on_messages(messages, cancellation_token)

async def glyphward_scan(b64_data: str, modality: str) -> int:
    async with httpx.AsyncClient() as client:
        resp = await client.post(
            "https://api.glyphward.com/v1/scan",
            json={"data": b64_data, "modality": modality, "source_trust": "low"},
            headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
            timeout=5,
        )
        return resp.json()["score"]

Apply ScannedAssistantAgent as the base for every agent in the group chat. The scan runs asynchronously per-agent per-message, in parallel with other pre-processing — marginal latency over a standard round-trip is typically under 200 ms per image block.

Legacy ConversableAgent — override receive()

For AutoGen v0.2 / v0.3 (the ConversableAgent pattern), the intercept point is the receive() method, which is called each time an agent gets a message before its generate_reply() runs:

class ScannedConversableAgent(ConversableAgent):
    def receive(self, message, sender, request_reply=None, silent=False):
        # message is a dict; content may be a list with image_url entries
        if isinstance(message.get("content"), list):
            for part in message["content"]:
                if part.get("type") == "image_url":
                    url_or_b64 = part["image_url"].get("url", "")
                    if url_or_b64.startswith("data:image"):
                        b64 = url_or_b64.split(",", 1)[1]
                        score = glyphward_scan_sync(b64, "image")
                        if score > 70:
                            raise Exception(f"PI blocked (score {score})")
        super().receive(message, sender, request_reply, silent)

The same pattern applies to any framework that uses the OpenAI message format with image_url content blocks — the scan placement is identical, only the hook method name changes.

Threat model: why agentic escalation is worse than single-LLM PI

In a single-agent setup, a successful prompt injection causes one model to misbehave once. In an AutoGen group chat, a successful injection causes every downstream agent to operate under the attacker's instruction for the rest of the conversation. A researcher agent that produces a maliciously crafted chart passes that chart to a writer agent, a QA agent, and an executor agent — all of whom see the embedded instruction in their context. The executor agent in particular may have code-execution, file-write, or API-call privileges, turning an image PI into arbitrary code execution.

This is the agentic escalation pattern documented in MITRE ATLAS T0051 (LLM Prompt Injection) and in the OWASP LLM01:2025 multimodal sub-category. The blast radius in a group-chat architecture is larger than in a direct-chat architecture by the number of agents in the group minus one.

How Glyphward fits

Glyphward's /v1/scan accepts image bytes (base64 PNG / JPEG / WebP) or audio bytes (WAV / MP3 / OGG), returns a 0–100 risk score, the flagged bounding region or time window, and per-signal confidences. The async Python client fits naturally into AutoGen v0.4's async agent architecture. The sync client works in legacy receive() overrides.

Existing text-side guardrails — Lakera Guard, Azure Prompt Shields, LLM Guard — remain on the text and function_call message paths. Glyphward closes the image_url, file, and audio paths those guards cannot reach. See pricing comparison for the full cost-per-scan analysis across providers.

Get early access

Related questions

Does AutoGen have any built-in multimodal safety features?

AutoGen v0.4 supports content filtering via Azure OpenAI's built-in filters when the underlying model is Azure-hosted. Azure's content filters include image moderation for hate/violence/adult content, but they are not designed to detect prompt-injection payloads embedded in images. A FigStep glyph block on a white background passes Azure's content filter (it is not objectionable content) and still delivers the injection. Content moderation and prompt-injection detection are distinct functions.

What if the image is generated by the code executor, not uploaded by a user?

Code-generated images carry the trust level of the code that generated them, which is at most as trusted as the inputs that code received. If the code ran on attacker-controlled data, the output image may contain a crafted payload. Scan code-executor output images the same way you scan user-supplied images — the source is less direct but the risk is equivalent.

Is this relevant to the new AutoGen Studio / Teams interface?

Yes. AutoGen Studio and the Teams interface build on the same AgentChat API and ConversableAgent patterns. Any image or file that enters the team conversation goes through the same message relay. Apply the scan at the agent level rather than at the UI level so the protection applies regardless of which front-end or session mode is used.

How does this compare to what's needed for CrewAI?

The scanner is identical. Placement differs: CrewAI has a Task/callback hook; AutoGen v0.4 has the on_messages override; legacy AutoGen has receive(). The threat model is the same — multi-agent relaying amplifies single-agent PI risk. Choose the hook native to your framework; the Glyphward API call is the same regardless.

What threshold should I use for blocking vs. logging?

Block at score > 80 (high-confidence PI from a low-trust source). Log and downgrade permissions at 50–80. Pass through at < 50. Tighten for user-supplied files (lower threshold); loosen for internally generated images (higher threshold). The source_trust field on the scan request handles this at the API level without forking your threshold logic.

Further reading