ICP-by-product · LlamaIndex agents

Prompt-injection scanner for LlamaIndex agents

LlamaIndex (now llama-index-core) is the most widely used framework for RAG pipelines and LLM agents in Python. Its SimpleDirectoryReader ingests PDFs, images, and other documents; its MultiModalVectorStoreIndex indexes images alongside text and retrieves them to multimodal LLMs at query time; its ReActAgent and FunctionCallingAgent can receive image data as tool outputs. At each of these stages, image bytes pass into the vision encoder of GPT-4o, Claude, or Gemini without a prompt-injection scan. A FigStep-class adversarial instruction embedded in a PDF image or a standalone image file is invisible to LlamaIndex's text extraction and invisible to text-only PI scanners — it enters the vector store, persists in the index, and delivers its payload on every retrieval. Scan at ingestion time and at response-generation time.

TL;DR

Two intercept points for LlamaIndex multimodal pipelines: (1) pre-ingestion — scan image files and PDF-embedded images before SimpleDirectoryReader.load_data() writes them to the index; (2) pre-generation — scan retrieved image nodes before the response LLM receives them in a MultiModalVectorStoreIndex query. One POST to /v1/scan per image, under 200 ms. Free tier: 10 scans/day, no card. Start on the free tier.

The two PI surfaces in LlamaIndex multimodal pipelines

LlamaIndex pipelines typically have two places where image bytes interact with a vision LLM:

Ingestion time (document loading + indexing). SimpleDirectoryReader loads files from a directory. For PDFs, it uses PDFReader (pypdf under the hood) which extracts text from each page. Embedded images on PDF pages are handled only if you configure a multimodal document reader — and even then, the extracted image bytes are passed to a vision model for captioning or indexing without a PI scan on the raw bytes first. A PDF with a FigStep payload on an image-heavy page enters the index with its payload intact.
Query / generation time (retrieval + response LLM). A MultiModalVectorStoreIndex stores both text chunks and image embeddings (via CLIP or similar). When a user query retrieves image nodes, those images are included in the prompt to the response LLM (GPT-4o, Claude, Gemini) as image content blocks. The image bytes travel from the vector store to the vision encoder without a PI scan in between.

The first surface produces a persistent threat: one poisoned document infects the index and fires on every subsequent retrieval that surfaces it — the OWASP LLM03 RAG corpus poisoning pattern. The second surface is a per-query threat: any image in the retrieval corpus can deliver its payload on demand. Defending both points is the complete architecture.

Pre-ingestion scan: the document loading intercept

The correct place for the ingestion-time scan is before SimpleDirectoryReader.load_data() adds documents to the index. The following helper scans all images extracted from a PDF file before allowing it to proceed to loading:

import httpx
import base64
import os
from pathlib import Path

try:
    import fitz  # PyMuPDF — for extracting embedded images from PDFs
except ImportError:
    fitz = None

GLYPHWARD_API_KEY = os.environ["GLYPHWARD_API_KEY"]

def _scan_bytes(img_bytes: bytes, label: str, threshold: int = 70) -> None:
    resp = httpx.post(
        "https://api.glyphward.com/v1/scan",
        json={
            "data": base64.b64encode(img_bytes).decode(),
            "modality": "image",
            "source_trust": "low",
        },
        headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
        timeout=5,
    )
    result = resp.json()
    if result["score"] > threshold:
        raise ValueError(
            f"{label}: multimodal PI score {result['score']} "
            f"(region: {result.get('region')})"
        )

def scan_document_for_pi(file_path: str | Path) -> None:
    """Scan all embedded images in a document before indexing.
    Supports PDF (embedded images), PNG, JPEG, WebP.
    Raises if any image exceeds the PI score threshold.
    """
    path = Path(file_path)
    suffix = path.suffix.lower()

    if suffix == ".pdf" and fitz is not None:
        doc = fitz.open(str(path))
        for page_num, page in enumerate(doc):
            for img_index, img_ref in enumerate(page.get_images(full=True)):
                xref = img_ref[0]
                base_image = doc.extract_image(xref)
                img_bytes = base_image["image"]
                _scan_bytes(img_bytes, f"{path.name} page {page_num + 1} image {img_index + 1}")
        doc.close()

    elif suffix in (".png", ".jpg", ".jpeg", ".webp", ".gif", ".bmp"):
        img_bytes = path.read_bytes()
        _scan_bytes(img_bytes, path.name)

# Usage:
# for doc_path in Path("./data").iterdir():
#     scan_document_for_pi(doc_path)  # raises on PI detection
# documents = SimpleDirectoryReader("./data").load_data()
# index = MultiModalVectorStoreIndex.from_documents(documents)

Run scan_document_for_pi() on every file before passing the directory to SimpleDirectoryReader. Quarantine flagged files in a separate directory pending review; do not load them into the index. Log the scan ID and file hash as your LLM03 dataset-provenance record and your ISO 27001 A.8.28 input-validation evidence.

Pre-generation scan: the retrieval intercept

At query time, a MultiModalVectorStoreIndex retriever returns NodeWithScore objects — some containing text chunks, some containing image paths or image bytes. Before those image nodes are assembled into the prompt for the response LLM, scan each image node:

from llama_index.core.schema import ImageNode, NodeWithScore
from llama_index.core import MultiModalVectorStoreIndex

def safe_multimodal_query(index: MultiModalVectorStoreIndex, query_str: str) -> str:
    """Query a MultiModalVectorStoreIndex with PI scanning of retrieved image nodes."""
    retriever = index.as_retriever(similarity_top_k=5)
    retrieved_nodes = retriever.retrieve(query_str)

    # Scan all image nodes before passing to the response LLM
    for node_with_score in retrieved_nodes:
        node = node_with_score.node
        if isinstance(node, ImageNode):
            # ImageNode stores image bytes or a local file path
            if node.image is not None:
                img_bytes = (
                    base64.b64decode(node.image)
                    if isinstance(node.image, str)
                    else node.image
                )
            elif node.image_path:
                img_bytes = Path(node.image_path).read_bytes()
            else:
                continue
            _scan_bytes(img_bytes, f"retrieved image node {node.node_id[:8]}", threshold=60)

    # Build the query engine from the retriever with pre-scanned nodes
    query_engine = index.as_query_engine(
        similarity_top_k=5,
        image_similarity_top_k=3,
    )
    return str(query_engine.query(query_str))

The threshold at generation time (60) is tighter than at ingestion time (70) because retrieved images are passed directly into the response LLM's context in the current turn — the blast radius is higher. If a scan fails at retrieval time, the image node is excluded from the response LLM's context and the query proceeds with only text nodes.

LlamaIndex agents: image tool results

LlamaIndex's ReActAgent and FunctionCallingAgent can call tools that return multimodal data. A web-scraping tool that returns a screenshot, a chart-generation tool that returns a PNG, or a file-reading tool that returns an image — all of these deliver image bytes into the agent's context as tool results. These tool results carry implicit trust: the agent's reasoning loop treats them as authoritative feedback from the environment.

Scan tool results that contain images before appending them to the agent's message history. The intercept point is in your tool implementation, before returning the result to the agent framework:

from llama_index.core.tools import FunctionTool

def screenshot_tool(url: str) -> dict:
    """Fetch and return a screenshot of a URL — with PI scanning."""
    screenshot_bytes = _take_screenshot(url)  # your screenshot implementation
    _scan_bytes(screenshot_bytes, f"screenshot of {url}", threshold=50)
    return {
        "type": "image",
        "data": base64.b64encode(screenshot_bytes).decode(),
        "mime_type": "image/png",
    }

screenshot_fn_tool = FunctionTool.from_defaults(fn=screenshot_tool)

The threshold for agentic tool results (50) is tighter than for retrieval (60) and ingestion (70) because agents take actions based on tool results — the tightest threshold applies where the blast radius is greatest.

Get early access