Attack surface guide · Autonomous research agents

Prompt injection in autonomous AI research agents

A chatbot handles one exchange, then waits. An autonomous AI research agent — GPT-Researcher, AutoGPT, Perplexica, OpenDevin, or any custom LangGraph research loop — runs for minutes to hours, fires dozens of tool calls, writes intermediate findings to files, and synthesises a final report, all with no human reviewing each step. These agents are multimodal by necessity: they browse web pages containing charts, download arXiv PDFs rendered page-by-page for a vision model to extract figures, and take Playwright screenshots of sources they visit. Every external image that enters that pipeline is attacker-controlled. The agent never chose to trust those images — it simply fetched them as part of its research. Greshake et al. (2023) named this class of attack indirect prompt injection, and the OWASP LLM Top 10 lists it as LLM01. The multimodal variant — where the injection payload is carried inside an image rather than in HTML text — is invisible to every text-only defence currently deployed in front of research agents.

TL;DR

Autonomous research agents pass external images directly to a multimodal LLM with no content inspection. An adversarial image embedded in a web page, PDF, browser screenshot, or Wikipedia infobox can redirect the agent's entire research goal, exfiltrate partial findings to an attacker URL, or corrupt the final report — all within a single autonomous session lasting hours. Text output monitors, URL allowlists, and source-quality filters do not inspect image bytes and cannot catch this. Call POST https://glyphward.com/v1/scan on every image before it enters the agent's context window. Use the async pattern below to batch-scan all images from a single page in parallel so the research loop does not stall. Free tier — 10 scans/day, no card required.

The four multimodal injection surfaces in AI research agents

1. Web search result images. Research agents query Google, Bing, or SerpAPI and receive both text snippets and image results — charts, infographics, and thumbnail previews from indexed pages. When the agent follows a result URL to read the source, its browser tool or HTML scraper also fetches the images on that page. If the agent passes the full page rendering (or individual images) to its multimodal model for comprehension, any image on the page is now inside the model's context window. An attacker who controls a page indexed by these search engines — or who can place an image on a high-ranking page (via a comment, an embedded widget, or an ad) — can craft an adversarial image that instructs the agent to redirect its research goal to a different topic, exfiltrate the partial findings accumulated so far by issuing a GET request to an attacker-controlled endpoint, or produce a biased summary that favours a particular conclusion. Because the agent has already spent significant autonomous work reaching that source, the injection arrives late in the pipeline where the context is dense with trusted-looking intermediate findings — making the vision model more likely to comply.

2. PDF page rendering during academic paper retrieval. A core capability of research agents is downloading academic papers from arXiv, Semantic Scholar, or institutional repositories and extracting their content. Most agents render PDF pages as images using pdf2image or PyMuPDF, then pass the rendered page images to the vision model to extract figures, captions, equations, and tables that are lost in plain text extraction. Any figure inside the PDF is a potential injection vector. An adversary who publishes an arXiv-style preprint — or who compromises a legitimate paper's supplementary materials — can embed an adversarial image as a figure with carefully crafted pixel-level content. When the agent renders that page and passes it to the vision model, the instruction in the image may direct the agent to include fabricated citations in its final report, attribute false claims to real authors, or subtly alter a numerical finding in the synthesised output. The agent's provenance chain traces the injected information to a legitimate-looking academic source, making the attack difficult to audit after the fact.

3. Web page screenshots via browser automation. Research agents that use Playwright or Selenium to take screenshots of visited pages do so specifically so the vision model can understand page structure, tables, and visual content that does not survive HTML parsing — rendering entire pages as a single image that the model interprets holistically. This creates the broadest attack surface of the four: any element rendered by the browser (visible or invisible) appears in the screenshot. An attacker can serve a page with an adversarially styled element — white text on a white background, a zero-opacity div, or a fixed-position element outside the visible viewport that the screenshot still captures — containing an injection instruction. The legitimate content of the page is preserved (the page is not obviously malicious to a human visitor), but the screenshot passed to the vision model contains the instruction. Browser content security policy (CSP) does not apply to rendered page content that the browser itself is synthesising into a screenshot; it governs which external scripts and resources load, not what the renderer produces on screen.

4. Wikipedia and knowledge-base image infoboxes. Wikipedia is the highest-frequency source for nearly every factual research query. Almost every Wikipedia article has an infobox with a thumbnail image — a portrait, a map, a diagram, a chart — sourced from Wikimedia Commons. Wikimedia Commons is community-editable: images can be uploaded and modified by anyone with an account. An adversarial image uploaded to Commons and transcluded into a Wikipedia article enters the context of every research agent that visits that article. Unlike a single attacker-controlled web page, a compromised Wikimedia Commons image propagates simultaneously to every agent querying every article that uses it — potentially thousands of concurrent research sessions across GPT-Researcher deployments, AutoGPT instances, and custom research loops. The blast radius of a single Commons upload is not bounded by the attacker's own infrastructure; it scales with the popularity of every article that displays the image.

Integration: scan gate for web-browsing research agents

import asyncio, base64, logging
from dataclasses import dataclass
from typing import Optional
import httpx

GLYPHWARD_KEY = "<your-glyphward-api-key>"
GLYPHWARD_URL = "https://glyphward.com/v1/scan"
SCAN_THRESHOLD = 70        # scores ≥ 70 are treated as adversarial
REQUEST_TIMEOUT = 8.0      # seconds — well inside a research loop's page-fetch budget

logger = logging.getLogger("research_agent.scan_gate")


@dataclass
class ScanResult:
    safe: bool
    score: int
    scan_id: str
    source_url: str


async def scan_image_async(
    client: httpx.AsyncClient,
    image_bytes: bytes,
    source_url: str,
) -> ScanResult:
    """Scan a single image. Returns ScanResult(safe=False) if scanner is unreachable (fail-closed)."""
    encoded = base64.b64encode(image_bytes).decode()
    try:
        resp = await client.post(
            GLYPHWARD_URL,
            json={"image": encoded, "source": source_url},
            headers={"Authorization": f"Bearer {GLYPHWARD_KEY}"},
            timeout=REQUEST_TIMEOUT,
        )
        resp.raise_for_status()
        data = resp.json()
        score = data["score"]
        scan_id = data["scan_id"]
        safe = score < SCAN_THRESHOLD
        if not safe:
            logger.warning(
                "Adversarial image detected",
                extra={"scan_id": scan_id, "score": score, "source_url": source_url},
            )
        return ScanResult(safe=safe, score=score, scan_id=scan_id, source_url=source_url)
    except Exception as exc:
        # Fail-closed: if the scanner cannot be reached, treat the image as unsafe.
        logger.error("Glyphward scan failed for %s: %s — image redacted", source_url, exc)
        return ScanResult(safe=False, score=100, scan_id="error", source_url=source_url)


async def scan_page_images(
    images: list[tuple[bytes, str]],   # list of (image_bytes, source_url) pairs
) -> list[Optional[bytes]]:
    """
    Batch-scan all images from a single web page in parallel.
    Returns a list aligned with the input: safe image bytes, or None if redacted.

    Using asyncio.gather() keeps total scan latency near max(individual_scan_times)
    rather than sum — typically <200 ms per image, so a page with 10 images
    adds ~200 ms total, not 2 000 ms.
    """
    async with httpx.AsyncClient() as client:
        tasks = [
            scan_image_async(client, img_bytes, url)
            for img_bytes, url in images
        ]
        results: list[ScanResult] = await asyncio.gather(*tasks)

    return [
        img_bytes if result.safe else None
        for (img_bytes, _), result in zip(images, results)
    ]


# ── Research agent loop integration ─────────────────────────────────────────

async def process_page_for_research_agent(
    page_text: str,
    page_images: list[tuple[bytes, str]],  # (image_bytes, image_url_for_logging)
    agent_context: list,
) -> list:
    """
    Called once per page the research agent visits.
    Scans all images in parallel, redacts adversarial ones, then appends
    safe content to the agent context for the next LLM planning call.
    """
    # Scan all images from this page concurrently — fail-closed on error
    scanned_images = await scan_page_images(page_images)

    safe_images = [
        img for img in scanned_images if img is not None
    ]
    redacted_count = len(page_images) - len(safe_images)
    if redacted_count:
        logger.info(
            "%d image(s) redacted from page before entering agent context",
            redacted_count,
        )

    # Build the context entry: text always included; only safe images appended
    context_entry = {"text": page_text, "images": safe_images}
    agent_context.append(context_entry)

    return agent_context

The scan_page_images() function uses asyncio.gather() to fire all Glyphward scan requests for a given page simultaneously. For a typical research-agent page fetch (3–12 images per page), the additional latency is bounded by the slowest single scan — under 200 ms — rather than multiplied by image count. The fail-closed branch in scan_image_async() ensures that a transient network error or scanner outage never silently lets an unscanned image into the agent context: it returns safe=False and logs the event, causing the image to be redacted. The agent continues its research loop without the redacted image, degrading gracefully rather than halting. Apply process_page_for_research_agent() at every page-fetch step — web search result pages, academic PDF pages rendered as images, Playwright screenshots, and Wikipedia article views — so that no external image reaches the vision model's context without a scan.

Get early access

Coverage matrix

Defence layer Web search result images PDF page rendering (arXiv figures) Browser screenshot (Playwright) Wikipedia infobox images
Text output monitor (LLM Guard, Lakera) No — inspects LLM text output; image bytes never reach the monitor No No No
URL allowlist / domain filter No — adversarial images served from allowlisted domains (Google, arXiv, Wikipedia) No — arXiv is typically allowlisted No — Wikipedia is typically allowlisted No — wikimedia.org is allowlisted
GPT-Researcher source quality filter No — relevance and credibility scoring operates on text snippets, not image content No No No
Browser Content Security Policy (CSP) No — governs which resources load; does not inspect rendered image content No — CSP not applicable to downloaded PDF rendering No — screenshot is taken after CSP-compliant page loads No
Glyphward scan gate (async, per-image) Yes — scan before image enters agent context Yes — scan each rendered PDF page image Yes — scan screenshot before LLM planning call Yes — scan infobox image on Wikipedia fetch

Related questions

What is the blast radius of a successful injection in a research agent compared with a chatbot?

In a chatbot, a successful prompt injection corrupts one response in one turn. The user sees the output immediately and can correct or discard it. In an autonomous research agent, a successful injection at any point in a session that runs for minutes or hours corrupts everything that follows: subsequent tool calls, retrieval decisions, file writes, API calls, and the final synthesised report. The agent may have queried dozens of sources, accumulated thousands of tokens of intermediate findings, and written partial results to disk — all of which are now downstream of the injection point. If the injection caused the agent to exfiltrate findings via a GET request to an attacker URL, the data left the system before any human reviewed the output. If it caused the agent to include fabricated citations in a report, the report may be used downstream (in a blog post, a legal memo, a business decision) before the provenance is checked. The asymmetry between chatbot injection and research-agent injection is hours of autonomous work versus a single response — and the damage is proportional to the tools and write-access the agent has.

Which research agent frameworks are most at risk?

Any framework that (a) passes external images to a multimodal model and (b) uses the model's output to drive subsequent tool calls is at risk. The highest-risk frameworks today are: GPT-Researcher, which autonomously browses the web and processes images from retrieved pages as part of its research loop; AutoGPT with web browsing enabled, which uses a browser tool and can pass page screenshots to GPT-4o Vision or similar models; and any custom LangGraph or LangChain agent that combines a browser tool or PDF loader with a multimodal LLM (GPT-4o, Claude 3, Gemini 1.5 Pro) for content understanding. Agents that use text-only search APIs (SerpAPI returning only text snippets, Tavily text search) and never pass images to the model are not exposed to the multimodal injection surface — but as multimodal models become the default, this exception narrows. OpenDevin and similar code-execution agents are also at risk if they browse documentation sites or GitHub preview images as part of their research phase.

Can you scan images in real time during a long research loop without slowing it down?

Yes. The async pattern in the integration section above uses asyncio.gather() to scan all images from a single page fetch in parallel. Glyphward's scan endpoint returns in under 200 ms per image under normal load. For a page with 8 images, the total additional latency is approximately 200 ms — the time of the slowest single scan — rather than 8 × 200 ms = 1 600 ms. A typical research agent already spends 2–5 seconds per page fetch (network round-trip, HTML parsing, text extraction). Adding 200 ms of parallel image scanning is below measurement noise in the agent's total runtime. For research loops that process many pages concurrently (multiple async workers fetching different sources simultaneously), each worker calls scan_page_images() independently, so the scans for different pages also run in parallel across workers. The scan gate adds security without materially extending the research session.

Does this apply to AI research agents that only use text search?

No — if the agent exclusively calls text-only search APIs (e.g. SerpAPI returning JSON snippets, Brave Search text results, Tavily text search) and never downloads or renders images, the multimodal injection surface does not exist for that agent. The risk is specific to agents that pass image bytes to a vision-capable model (GPT-4o, Claude 3 Opus/Sonnet, Gemini 1.5 Pro, Mistral Pixtral). However, most research agent deployments in 2025–2026 use multimodal models by default because they handle tables, charts, and structured page content better than text-only models. If your agent calls a multimodal model — even only occasionally, for image-heavy pages — every image that reaches that model call is an attack surface. It is safer to assume your agent is multimodal and add the scan gate than to audit each code path for image passthrough.

How does this relate to indirect prompt injection in web content — isn't that already a known problem?

Indirect prompt injection via text in web pages is a well-documented attack: the adversary hides instructions in HTML comments, hidden divs, or white-on-white text, and the agent's HTML scraper extracts those instructions as plain text that the LLM then follows. This is the attack described by Greshake et al. (2023) and catalogued in OWASP LLM01. The multimodal image variant is distinct and more dangerous in two ways. First, the injection payload is entirely inside the image — it produces zero tokens in the extracted HTML text, so any text-based indirect-PI scanner (including Lakera Guard, Prompt Shield, and LLM Guard) sees nothing. Second, vision models interpret image content holistically: a well-crafted adversarial image can embed an instruction in a way that appears to be part of a legitimate chart or diagram, making the injection more persuasive to the model than a naked text instruction the model might be trained to refuse. The two attack surfaces are additive, not substitutes — defending against text indirect-PI does not cover image indirect-PI, and vice versa. Glyphward covers the image surface; use a text-PI scanner in parallel for the text surface.

Further reading