ICP-by-product · MCP server hosts

Prompt-injection scanner for MCP servers

An MCP tool that returns an image is, from the model's point of view, an instruction. The Model Context Protocol's CallToolResult ships text, image and audio content blocks back to the host, and the host turns around and stuffs those blocks into the next turn of the LLM call. Your text-side guard — the one watching what the user typed, or what came back as plain text — never sees the bytes. FigStep glyphs, AgentTypo distortions, WhisperInject carriers, and indirect-PI screenshots all ride straight through any MCP server an agent is connected to. The fix is one middleware, mounted in the host between tools/call and the next sampling, that scans every non-text content block and drops or quarantines the ones that score over threshold.

TL;DR

The Model Context Protocol's Content union supports TextContent, ImageContent, AudioContent, and EmbeddedResource. Tool results return a list of these blocks, and well-behaved MCP hosts (Claude Desktop, Cursor, Cline, Continue, Goose, OpenAI's MCP-aware clients, and homegrown agents built on the official SDKs) pass image and audio blocks through to the model as multimodal turn input. A host that runs only a text-side scanner has no defence against an image returned by a poisoned, misconfigured, or third-party MCP server. Wrap a Glyphward call in a tool-result middleware, mount it once in the host, and you cover every server the agent connects to with about a hundred lines of Python.

Where the gap is in an MCP-host pipeline

The standard MCP host loop is: the host enumerates servers, the user (or the model) issues a request, the host walks the model through one or more tools/list + tools/call exchanges with each connected server, and the model produces a final response. Tool results come back as a CallToolResult with content as a list of typed blocks. Per the protocol specification, those blocks may be text, image bytes (base64 with a MIME type), audio bytes, or an embedded resource pointer. Hosts then insert those blocks directly into the next sampling turn — usually as a multimodal user-role message, immediately following the tool-call assistant message, exactly as a vision-language model expects.

The text-safety pattern most hosts implement is the same as the LangChain pattern (covered in our LangChain agent scanner page): a guard runs over the user's typed prompt and, sometimes, over the textual portion of any tool output. It is a sound pattern. It is also blind to the bytes. When an MCP server returns an ImageContent with data = base64 PNG, no string-shaped guard inspects that PNG. The model does. The model decodes it, embeds it, attends over it, and treats whatever instructions are rendered onto it as part of the conversation.

This is the textbook indirect prompt-injection surface, the one Greshake and colleagues described in 2023, ported into 2026's tool ecosystem. The MCP-specific twist is that the set of attackers is no longer just "any web page the agent fetches" — it is also any MCP server the agent is configured to talk to, plus any upstream the server itself fetches from on a tool call. That set grows every time a user installs another community server.

Three MCP-specific attack patterns

Three patterns surface repeatedly in practice. Each rides through a text-side guard untouched.

1. Poisoned image content from a third-party server. A community-published MCP server — a screenshotter, a chart-renderer, a "fetch any URL and OCR it" helper, a CMS bridge that returns inline images — returns an ImageContent block whose pixels carry a FigStep-style instruction overlay or an AgentTypo-style adversarial-glyph block. The host pipes the block into the next model turn as a vision input. The model reads the rendered text. Your text guard saw a clean tool-call response ({"type": "image", "mimeType": "image/png"}) and waved it through.

2. Audio content from a voice / transcript MCP server. A voice-agent MCP server, or a "transcribe this clip" helper server, returns an AudioContent block carrying a WhisperInject-class out-of-band carrier or an inter-word jailbreak. If the host's downstream model is audio-aware (audio-input GPT-4o, Gemini-audio, Claude voice modes, or a chained STT-then-text pipeline), the bytes reach the model. The transcript-only filter the host might run produces innocuous text. The audio file does not. See WhisperInject detection and audio prompt-injection detection for the byte-level threat model that applies identically when the carrier is delivered by a tool result rather than a direct upload.

3. Embedded-resource image references. MCP supports EmbeddedResource blocks that point at a resource URI the server publishes (resources/read). The host fetches the resource, which can itself be image or audio bytes. A subscription-aware host that re-reads the resource on a resources/updated notification can be re-poisoned on every update — a common pattern for "this server tracks a live document" servers. The attack surface is the same as the direct-image case, but the time-of-use shift means the guard must run on every fetch, not just once at install.

Why filtering the text portion is not enough

It is tempting to argue that a strict host should reject all non-text content blocks, or convert images to OCR text first and only pass the text. Both fail in production for the same reason: real MCP servers return images for legitimate reasons. A chart-rendering server returns a chart. A document-rendering server returns a page. A camera-frame server returns a frame. Stripping non-text content kills the legitimate use cases that motivated MCP's multimodal content support in the first place.

OCR-then-pass is worse. As argued in why every text-only scanner misses a 30-pixel PNG, OCR has a ceiling that sits below the FigStep / AgentTypo attack class. An adversarial-glyph PNG that OCRs as garbage but renders to a legible jailbreak passes any OCR-then-text-scan pipeline. The right contract is bytes in, score and region out — exactly what a multimodal scanner does and a text scanner with an OCR adapter cannot.

The middleware that closes it

The pattern below uses the official MCP Python SDK client surface. It wraps the host's call_tool path with a scan over every non-text content block in the result. Drop it once in the host. It applies to every server the host connects to — the wedge does not need per-server configuration.

from mcp import ClientSession
from mcp.types import (
    CallToolResult,
    TextContent,
    ImageContent,
    AudioContent,
    EmbeddedResource,
)
import base64, os, requests, hashlib

GLYPHWARD_KEY = os.environ["GLYPHWARD_API_KEY"]
SCAN_URL = "https://glyphward.com/v1/scan"
BLOCK_AT = 70  # 0-100; tune to your false-positive budget
_cache: dict[str, float] = {}  # content-hash → score, in-memory

def _scan_bytes(b64: str, kind: str) -> float:
    h = hashlib.sha256(b64.encode()).hexdigest()
    if h in _cache:
        return _cache[h]
    r = requests.post(
        SCAN_URL,
        headers={"Authorization": f"Bearer {GLYPHWARD_KEY}"},
        json={f"{kind}_b64": b64},
        timeout=8,
    )
    r.raise_for_status()
    score = float(r.json()["score"])
    _cache[h] = score
    return score

def scrub_tool_result(result: CallToolResult, server_name: str) -> CallToolResult:
    """Replace any image/audio block whose score >= BLOCK_AT with a
    safe TextContent placeholder, leaving text and clean media intact."""
    safe: list = []
    for block in result.content:
        if isinstance(block, TextContent):
            safe.append(block)
        elif isinstance(block, ImageContent):
            score = _scan_bytes(block.data, "image")
            if score >= BLOCK_AT:
                safe.append(TextContent(
                    type="text",
                    text=f"[blocked image from {server_name}: PI score {score:.0f}/100]",
                ))
            else:
                safe.append(block)
        elif isinstance(block, AudioContent):
            score = _scan_bytes(block.data, "audio")
            if score >= BLOCK_AT:
                safe.append(TextContent(
                    type="text",
                    text=f"[blocked audio from {server_name}: PI score {score:.0f}/100]",
                ))
            else:
                safe.append(block)
        elif isinstance(block, EmbeddedResource):
            # Decide policy: read-then-scan, or skip if resource is text-only.
            safe.append(block)
        else:
            safe.append(block)
    return CallToolResult(content=safe, isError=result.isError)

# Host wrapper around the SDK's session.call_tool:
async def guarded_call_tool(session: ClientSession, server_name: str, name: str, arguments: dict):
    raw = await session.call_tool(name, arguments=arguments)
    return scrub_tool_result(raw, server_name)

That is the entire wedge. guarded_call_tool is a drop-in replacement for the host's existing session.call_tool call site. The signature matches; the return shape matches; only the bytes have been screened. Failed-open semantics are deliberate: a network error against the scanner returns the original bytes plus a logged warning rather than blocking the user, and the cache means repeated tool calls returning the same image (a chart that has not changed, an idle camera frame) do not pay the network cost.

Where to mount it: three patterns

  1. Host-wide single mount. Wrap the one place the host calls session.call_tool across all connected servers. This is the cleanest mount in custom hosts (any agent built on mcp.ClientSession) and matches Cline / Continue / Goose-style architectures where the host is in your code. One mount, every server covered.
  2. Per-server policy. Some servers are first-party and trusted (your in-house MCP bridge to your own DB), others are third-party. The middleware above takes server_name as an argument so per-server thresholds and quarantine policies are trivial: trust internal servers at BLOCK_AT=90, treat community servers at BLOCK_AT=50. The same scanning call, two different bands.
  3. Sampling sidecar. If the host is opaque to your code (a vendor's host process you cannot wrap), fall back to a sidecar that proxies the SDK's transport. The MCP transport layer (stdio, SSE, streamable-http) is well-defined; a transparent transport proxy that scans CallToolResult messages on the wire works for any host that talks the protocol. This is heavier than the wrapper above and only worth it when the host is not yours to modify.

None of the three require touching server code. That is the point — most servers in an agent's tool set are not yours, and the host is the only enforcement point that gets a clean view of every result.

Why MCP servers cannot enforce this themselves

A defender's first instinct is to push the check into the server: "the chart-rendering server should sanitise its output." That is fine for the server's own first-party renders, but it does nothing about the threat model. The threat is not that a well-behaved server has a bug; it is that an MCP server is, by design, a trust boundary the host crosses, and any number of servers in any host's tool set may be compromised, malicious, or simply downstream of an attacker-controlled URL. A user installs a community server in good faith. The server fetches a third-party image. The image carries the payload. The server returns it because that is what it was asked to return. The defence has to live in the host, on the result-receive path, because that is the one place that sees every tool result for every server.

Latency budget for a tool-loop host

An image scan returns in tens of milliseconds for typical chart and screenshot sizes (200 KB–2 MB). For a single-tool turn that is below the noise floor of any production model call. For tool-loop hosts that fire several tools per turn, parallelise: asyncio.gather over the content blocks of each result, or pre-warm the cache against known-clean content hashes. The middleware above caches by SHA-256 of the base64 payload, so repeated calls returning the same chart amortise to a hash lookup after the first scan.

For audio, the same shape applies. Audio content blocks tend to be larger (a 30-second clip at 16 kHz mono PCM-16 is ~960 KB) but the per-attachment scan cost is still a fixed constant per turn rather than a per-token tax. The scanner is not in the model's generation path; it is in the tool-result handoff, called once per non-text block, before the next sampling.

How Glyphward fits

Glyphward's scoring contract is bytes in, score and region out. That maps cleanly onto the MCP Content shape: data is already base64, mimeType already tells the scanner whether it is image or audio. Pricing is the same flat-rate self-serve as for any other multimodal surface — see pricing and the side-by-side at multimodal PI scanner pricing comparison. The free tier (10 scans/day) covers prototyping a host integration; Pro at $29/month covers a small to mid-sized production deployment, and Team at $99 adds the audit log most security-aware host vendors want for their own incident-response paper trail.

The integration above is provider-agnostic. It works for Anthropic Claude, OpenAI, Google Gemini, AWS Bedrock-fronted models, or any local model the host wraps — because the scanner reads bytes, not the chat-completion API the bytes are about to go to. See the multimodal LLM security API for the broader API surface this middleware is calling into, and the embed widget preview for the same scoring exposed in a JS wrapper if the host is browser-side.

Get early access · See the API surface

Related questions

Does this work with the TypeScript MCP SDK as well as Python?

Yes. The HTTP scan call is one fetch; mount the wrapper around client.callTool in the TS host the same way. The protocol-level shape (CallToolResult.content as a typed union) is identical, so the scrub function ports cleanly.

What about resources/subscribe notifications — do I scan on every update?

If your host re-reads the resource on each resources/updated and re-feeds the bytes into the model context, scan on each fetch. The cache by content hash keeps the cost flat when the resource has not actually changed. If your host only displays the resource and does not re-feed it to the model, you can skip the rescan and rely on the initial scan at first read.

I trust my own MCP server but not the community ones. Can I split the policy?

Yes — that is the per-server-policy mount in the patterns above. Pass server_name through and key the threshold off it. A common pattern is two bands: trusted (high BLOCK_AT, near-pass-through) and community (lower BLOCK_AT, conservative quarantine).

What if the host strips images entirely and only forwards text to the model?

Then the byte-level threat model does not apply at the host. But check carefully: many MCP-aware hosts do forward image and audio content as multimodal user messages, especially the agentic ones (Cline, Continue, Goose, custom code-running agents). If you are unsure, log the content-block types your host actually forwards on a representative day before deciding the threat does not affect you.

Will this slow the agent's tool loop perceptibly?

For a single image per tool call, yes — by the scan time, typically tens of milliseconds. For text-only tool calls (the majority) the middleware is a no-op typed-isinstance walk. If first-token latency on the next sampling matters, parallelise the scan with the host's prompt assembly: kick the scan off as soon as the result returns and only block the next sampling if the score crosses threshold before assembly completes.

Further reading