ICP-by-platform · Unstructured.io

Unstructured.io prompt-injection detection

Unstructured.io is the dominant document-parsing library for RAG pipelines — it turns PDFs, Word documents, HTML, Excel files, and dozens of other formats into clean element streams that feed vector databases and LLM context windows. When those source documents come from external parties — customers, vendors, the public web — the image elements that Unstructured.io extracts are untrusted inputs. A PDF with an adversarial image page, a Word document with an embedded figure containing a low-contrast text overlay, or a scanned form where the scan itself carries a typographic PI payload — all of these pass through Unstructured.io's partition functions as Image elements and arrive at the vision LLM without any content inspection. Glyphward plugs into the element pipeline between partition_pdf() and the LLM ingestion call, scanning each Image and Figure element before it enters the model's context.

TL;DR

After calling partition_pdf() or any Unstructured partition function, filter the element list for Image and Figure element types, POST their metadata.image_base64 to Glyphward's /v1/scan endpoint, and drop any element that returns score ≥ 70 before passing the element list to your LLM. One-time integration: 10–15 lines of Python, fits into any existing Unstructured pipeline without changing the partition or LLM call. Free tier — 10 scans/day, no card.

How image elements enter an Unstructured RAG pipeline

PDF image pages. When partition_pdf() is called with strategy="hi_res", Unstructured renders each page as a raster image and uses a document layout model (detectron2 or similar) to classify page regions. Pages that are primarily images — scanned documents, invoice PDFs, engineering drawings — produce Image elements. These elements carry metadata.image_base64 (the rendered page image as a base64 PNG) and are passed directly to vision LLMs for multimodal RAG. The source document is often external and untrusted.

Embedded figures and charts. PDFs and Word documents with embedded images (product photos, architectural diagrams, charts, scanned attachments) produce Figure and Image elements for each embedded image. These are extracted from the document binary at parse time. A document from a counterparty, a regulatory filing, or a customer submission can contain adversarial payloads in its embedded figures. The Unstructured pipeline routes these to vision LLMs without checking the image content.

HTML pages with inline images. partition_html() extracts inline image elements from web pages. A RAG pipeline that indexes web content — news articles, competitor pages, product documentation — may ingest images from those pages as part of the element stream. Adversarial images placed in web pages are the indirect prompt injection via image pattern applied to web-content RAG. An attacker who controls a page that the pipeline indexes can plant a payload image that will be extracted by Unstructured and passed to the vision LLM during the next indexing run.

Table screenshots and diagram pages in enterprise documents. Enterprise knowledge bases frequently contain documents where tables and diagrams were saved as images (screenshot of a spreadsheet, scanned engineering drawing, photo of a whiteboard). These elements are common in the internal enterprise documents that Unstructured-based RAG pipelines are deployed to index. While internal documents are generally lower-risk, external-originated files that enter the enterprise knowledge base — vendor documentation, regulatory guidance, customer-provided materials — are untrusted inputs and should be scanned.

Python integration: scan Unstructured image elements before LLM ingestion

This pattern fits into any Unstructured pipeline that uses the hi_res strategy and passes image elements to a vision LLM. Insert scan_image_elements() between the partition call and the LLM ingest:

import base64, httpx
from unstructured.partition.pdf import partition_pdf
from unstructured.documents.elements import Image, Figure

GLYPHWARD_API_KEY = os.environ["GLYPHWARD_API_KEY"]
PI_SCAN_THRESHOLD = 70

def scan_image_elements(
    elements: list,
    source_doc: str = "unknown",
) -> tuple[list, list]:
    """
    Scan Image and Figure elements for PI payloads.
    Returns (clean_elements, blocked_elements).
    """
    clean, blocked = [], []

    for element in elements:
        if not isinstance(element, (Image, Figure)):
            clean.append(element)
            continue

        image_b64 = getattr(element.metadata, "image_base64", None)
        if not image_b64:
            # No image data — pass through unchanged
            clean.append(element)
            continue

        try:
            resp = httpx.post(
                "https://glyphward.com/v1/scan",
                headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
                json={
                    "image": image_b64,
                    "source": "unstructured_pipeline",
                    "metadata": {
                        "doc": source_doc,
                        "element_id": element.id,
                        "element_type": type(element).__name__,
                    },
                },
                timeout=5.0,
            )
            resp.raise_for_status()
            result = resp.json()

            if result["score"] >= PI_SCAN_THRESHOLD:
                print(
                    f"[glyphward] BLOCKED element={element.id} "
                    f"scan_id={result['scan_id']} score={result['score']} "
                    f"doc={source_doc}"
                )
                blocked.append((element, result))
            else:
                clean.append(element)

        except httpx.HTTPError as e:
            # Fail closed: if scan is unreachable, exclude image element
            print(f"[glyphward] scan error for element={element.id}: {e} — excluding")
            blocked.append((element, {"error": str(e)}))

    return clean, blocked


# Usage in your pipeline:
raw_elements = partition_pdf(
    filename="external_doc.pdf",
    strategy="hi_res",
    extract_images_in_pdf=True,
    extract_image_block_to_payload=True,  # ensures image_base64 is populated
)

clean_elements, blocked = scan_image_elements(
    raw_elements,
    source_doc="external_doc.pdf",
)

if blocked:
    print(f"[glyphward] {len(blocked)} image element(s) blocked from LLM ingestion")

# Only clean elements proceed to the LLM ingestion step
ingest_to_vector_store(clean_elements)

The extract_image_block_to_payload=True flag is required to populate metadata.image_base64 — without it, Unstructured writes image files to disk rather than including the base64 in the element metadata. If you use disk-based extraction, adapt the scan to load the file path from metadata.image_path instead.

Get early access

LangChain integration via UnstructuredLoader

If you use Unstructured through LangChain's UnstructuredPDFLoader or UnstructuredFileLoader, add a scan step in the document post-processing chain before the documents are split and embedded:

from langchain_community.document_loaders import UnstructuredPDFLoader
import base64, httpx

def scan_langchain_documents(documents: list, threshold: int = 70) -> list:
    """Filter LangChain Documents with image content for PI payloads."""
    clean = []
    for doc in documents:
        # LangChain UnstructuredLoader stores image content in page_content
        # for Image/Figure elements when extract_images=True
        if doc.metadata.get("category") not in ("Image", "Figure"):
            clean.append(doc)
            continue

        # Image content is base64-encoded in page_content for image elements
        image_b64 = doc.page_content
        if not image_b64 or len(image_b64) < 100:
            clean.append(doc)
            continue

        resp = httpx.post(
            "https://glyphward.com/v1/scan",
            headers={"Authorization": f"Bearer {os.environ['GLYPHWARD_API_KEY']}"},
            json={"image": image_b64, "source": "langchain_unstructured"},
            timeout=5.0,
        )
        result = resp.json()

        if result["score"] < threshold:
            clean.append(doc)
        else:
            print(f"[glyphward] blocked LangChain doc scan_id={result['scan_id']}")

    return clean


loader = UnstructuredPDFLoader(
    "external_report.pdf",
    mode="elements",
    strategy="hi_res",
    extract_images_in_pdf=True,
)
documents = loader.load()
clean_documents = scan_langchain_documents(documents)

# Proceed with text splitting and embedding on clean_documents only

Coverage matrix

Defence layer	PDF image page (scanned document)	Embedded figure in Word/PDF	HTML inline image (web RAG)	Table screenshot element
Unstructured.io built-in	Extracts — no PI check	Extracts — no PI check	Extracts — no PI check	Extracts — no PI check
Text extraction (PyMuPDF, pdfminer)	No — renders text layer only	No — OCR misses overlays	No	No
Text-only scanner (Lakera, LLM Guard)	No — image bytes ignored	No	No	No
Glyphward element scan	Yes — page-render scan	Yes — pixel-level	Yes	Yes

Related questions

Does Glyphward's scan work with Unstructured's serverless API (not self-hosted)?

Yes. The Glyphward scan is a standalone API call that is independent of whether you use Unstructured self-hosted, Unstructured API, or LangChain's Unstructured loaders. The integration point is the element list returned by the partition function — wherever that list is produced, you insert the scan step on the Image and Figure elements before they proceed to the LLM ingestion step.

What about text elements that might contain encoded PI instructions?

Text-element PI (instructions embedded in the plain text content of a PDF) is the domain of text-only scanners like Lakera Guard and LLM Guard — they are well-suited to that task. Glyphward focuses on the visual channel: Image and Figure elements where the PI payload is in the pixel stream rather than in the extracted text. For full coverage of an Unstructured pipeline, use both: a text-only scanner on NarrativeText, Title, Table, and other text elements, and Glyphward on Image and Figure elements.

Our RAG pipeline processes thousands of PDFs per day — is batch scanning available?

Yes. For high-volume document pipelines, use the Glyphward batch scan endpoint (/v1/scan/batch) which accepts an array of image inputs and returns an array of results in a single HTTP call. This amortises connection overhead across multiple elements from the same document. For PDFs with many image pages, batch all per-page images from a single document in one batch request rather than one scan request per page.

We use Unstructured to ingest internal company documents — do we need to scan those?

Internal documents authored entirely by trusted employees carry a lower risk than externally-sourced documents. For purely internal knowledge bases, scanning is optional. However, "internal" documents that include external attachments (vendor quotes forwarded via email, customer-submitted PDFs merged into internal records, regulatory filings downloaded from government portals) are partially external in origin. The conservative approach is to scan all documents from any source that involves external parties — internal authorship alone is not sufficient provenance in a pipeline that also handles externally-originated attachments.

Can Glyphward scan Unstructured elements that contain audio clips?

If your Unstructured pipeline processes documents with embedded audio (e.g., Word documents with narration clips, or multimedia content), use Glyphward's audio scan endpoint ("audio": "<base64>") on any audio binary data before passing it to a speech-to-text step. See audio prompt-injection detection for the waveform attack class and WhisperInject detection for the specific threat model.