Attack surface · PDF documents
PDF prompt-injection detection — scan embedded images before RAG ingestion
PDFs are the most common document format in enterprise RAG pipelines and multimodal AI applications. A PDF page can contain embedded raster images alongside its text layer. Standard text-extraction libraries — PyMuPDF (fitz), pdfplumber, Unstructured.io, LangChain's PyPDFLoader, LlamaIndex's SimpleDirectoryReader — read the text and skip the image pixels. An attacker embeds a FigStep-class adversarial instruction in a 30×30 PNG inside a PDF, uploads the document, and that instruction reaches your vision model on every retrieval — as trusted context, without any per-request user action. Scan the page-rendered image bytes before the document enters your corpus.
TL;DR
Use PyMuPDF (fitz) to render each PDF page to PNG bytes, then POST those bytes to Glyphward's /v1/scan. Quarantine any document with a score above threshold before it enters your vector store. One scan per page, under 200 ms per page, returns a 0–100 score and the flagged pixel region. Free tier: 10 scans/day, no card. Start on the free tier.
Why PDFs are the primary indirect-PI vector
PDFs are not text files with a wrapper. The PDF specification supports multiple content streams on a single page: text operators, vector graphics, and embedded raster images (XObject resources of subtype Image). A document can have a completely clean text extraction — the kind that would pass any string-based PI scanner — while simultaneously embedding a PNG that a vision-language model reads as an instruction. Three concrete attack patterns:
Pattern 1 — embedded image on an otherwise empty page. A one-page PDF where the entire visible content is a raster image. Tesseract OCR extracts the printed words from the image, but misses a 30-pixel FigStep region printed in an anti-OCR font inside the image. Text extraction returns clean text. The vision model reads the full pixel layer.
Pattern 2 — image overlay on a text-bearing page. A multi-page report where most pages have normal text content. One page has a small, low-contrast raster image overlaid — perhaps presented as a decorative logo or separator. Text extraction reads the surrounding text and ignores the image XObject. The retrieval returns the page with its surrounding context intact. The vision model reads both the text context (trusted) and the embedded instruction (attacker-controlled).
Pattern 3 — scanned document disguised as native PDF. A PDF where all content is a raster image (a scan). AgentTypo-class glyph distortions produce characters that Tesseract reads as benign text while a vision encoder reads the underlying adversarial instruction from the pixel layer. This is the structural ceiling described in Why every text-only scanner misses a 30-pixel PNG — OCR is operating on a derived representation, not the bytes the vision encoder consumes.
All three patterns share the key property that text extraction — the foundation of every text-only PI scanner integration — produces a clean result while the visual attack layer is still present in the document.
Python: scan a PDF with PyMuPDF before ingestion
PyMuPDF (pip install pymupdf) renders PDF pages to pixel arrays without relying on text extraction:
import io
import base64
import httpx
import fitz # PyMuPDF
GLYPHWARD_API_KEY = "YOUR_GLYPHWARD_API_KEY"
GLYPHWARD_SCAN_URL = "https://glyphward.com/v1/scan"
SCORE_THRESHOLD = 70 # use 60 for ingestion-time (conservative); 70 for inference-time
def scan_page_image(page_png_bytes: bytes, doc_id: str, page_num: int) -> dict:
encoded = base64.b64encode(page_png_bytes).decode()
resp = httpx.post(
GLYPHWARD_SCAN_URL,
headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
json={"image": encoded, "source": f"pdf:{doc_id}:page:{page_num}"},
timeout=5.0,
)
resp.raise_for_status()
return resp.json() # {score, flagged_region, scan_id, modality}
def scan_pdf_for_pi(pdf_bytes: bytes, doc_id: str, dpi: int = 150) -> list[dict]:
"""
Render each PDF page to PNG at `dpi` resolution and scan for PI.
Returns list of scan results for pages that exceeded threshold.
"""
flagged_pages = []
doc = fitz.open(stream=pdf_bytes, filetype="pdf")
for page_num in range(len(doc)):
page = doc[page_num]
mat = fitz.Matrix(dpi / 72, dpi / 72) # 72 dpi is PDF native
pix = page.get_pixmap(matrix=mat)
png_bytes = pix.tobytes("png")
result = scan_page_image(png_bytes, doc_id, page_num)
result["page"] = page_num
if result["score"] >= SCORE_THRESHOLD:
flagged_pages.append(result)
return flagged_pages # empty list = document is safe to ingest
def safe_ingest_pdf(pdf_bytes: bytes, doc_id: str) -> bool:
"""Returns True if safe to ingest; raises on flagged content."""
flagged = scan_pdf_for_pi(pdf_bytes, doc_id)
if flagged:
raise ValueError(
f"PDF {doc_id} contains {len(flagged)} page(s) with PI score "
f">= threshold. Pages: {[p['page'] for p in flagged]}. "
f"First scan_id: {flagged[0]['scan_id']}"
)
return True
Render at 150 DPI for a balance of detection sensitivity and file size. For high-stakes corpora (regulated documents, shared knowledge bases with anonymous write access), raise to 200 DPI. For bulk-ingestion pipelines with tight throughput requirements, 100 DPI is an acceptable compromise.
LangChain integration: pre-ingestion gate before PyPDFLoader
If you use LangChain's document loaders, add the scan gate before the loader reads the file:
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain.embeddings import OpenAIEmbeddings
def safe_langchain_pdf_ingest(pdf_path: str, doc_id: str, vectorstore: Chroma):
# Step 1: read raw bytes and scan
with open(pdf_path, "rb") as f:
pdf_bytes = f.read()
flagged = scan_pdf_for_pi(pdf_bytes, doc_id)
if flagged:
raise ValueError(f"Quarantine: {pdf_path} has {len(flagged)} flagged page(s).")
# Step 2: normal LangChain ingest (only reached if scan passed)
loader = PyPDFLoader(pdf_path)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
vectorstore.add_documents(chunks)
The same pattern works with UnstructuredPDFLoader, PDFMinerLoader, and any other loader that accepts a file path. The scan runs on the raw bytes before any loader call.
LlamaIndex integration: scan before SimpleDirectoryReader
For LlamaIndex pipelines, add the gate before SimpleDirectoryReader.load_data():
import os
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
def safe_llamaindex_pdf_ingest(pdf_dir: str, vectorstore_index: VectorStoreIndex):
for filename in os.listdir(pdf_dir):
if not filename.endswith(".pdf"):
continue
path = os.path.join(pdf_dir, filename)
with open(path, "rb") as f:
pdf_bytes = f.read()
flagged = scan_pdf_for_pi(pdf_bytes, doc_id=filename)
if flagged:
print(f"QUARANTINE: {filename} — {len(flagged)} flagged page(s). Skipping.")
continue
# only clean documents proceed to the index
reader = SimpleDirectoryReader(input_files=[path])
docs = reader.load_data()
vectorstore_index.insert_nodes(docs)
Compliance evidence: the per-document scan record
The scan record produced by scan_page_image() includes scan_id, score, flagged_region, and modality. Persist this record alongside your document metadata:
import json, datetime
def write_scan_provenance(doc_id: str, scan_results: list[dict], status: str):
"""Write per-document scan provenance to your audit log store."""
record = {
"doc_id": doc_id,
"scanned_at": datetime.datetime.utcnow().isoformat() + "Z",
"page_count": len(scan_results),
"status": status, # "allowed" | "quarantined"
"pages": scan_results,
}
# write to your audit log (database, S3, append-only log, etc.)
print(json.dumps(record))
This per-document record satisfies:
- ISO/IEC 27001:2022 Annex A.8.28 — external data sources must be treated as untrusted until validated; the scan record is the validation evidence.
- SOC 2 CC6.6 — logical access controls protecting against threats from outside the system boundary; the ingestion-time scan log is the control evidence an auditor will request when sampling documents that contained image content.
- OWASP LLM03:2025 — training data and retrieval corpus provenance; the scan record maps to the dataset-provenance step in the five-step LLM03-aligned architecture.
- EU AI Act Article 15(5) — the per-document record is the "detect and control for" evidence for the adversarial-examples vulnerability class named in Article 15(5) of Regulation (EU) 2024/1689.
Coverage matrix: text extractors vs Glyphward on PDF attack patterns
| Tool / approach | Pattern 1 (image-only page) | Pattern 2 (overlay image) | Pattern 3 (scanned OCR) | Per-page provenance record |
|---|---|---|---|---|
| PyMuPDF text extraction | Not detected (text layer clean) | Not detected (image XObject ignored) | Not detected (OCR ceiling) | No |
| Unstructured.io | Not detected | Not detected | Not detected (OCR ceiling) | No |
| Azure Form Recognizer / Doc Intelligence | Not detected (PI ≠ content moderation) | Not detected | Not detected (OCR ceiling) | No |
| Text-only PI scanner (Lakera, LLM Guard) | Not detected (receives clean text) | Not detected | Not detected | No |
| Glyphward (page-render scan) | Detected (scans pixel layer) | Detected (full-page render) | Detected (pixel-level, not OCR) | Yes — per-page scan_id + score |
The key difference is that Glyphward scans the rendered page pixels, not the extracted text. PDF text extraction is a lossy transform — it discards the visual layer. A scan that operates on the output of text extraction cannot recover the information that was discarded. The scan must operate on the bytes that reach the vision encoder, which are the full rendered page pixels.
Related questions
Does this apply to PDFs processed by Azure Form Recognizer or Google Document AI?
Yes. Document extraction services (Azure Form Recognizer, Google Document AI, AWS Textract) convert PDF content to structured text and key-value pairs. They are not prompt-injection scanners — they process documents for structured data extraction and content moderation for their own policy violations, not for injected instructions targeting downstream LLMs. If you feed their output to a vision model or include page images in an LLM context, scan those images with Glyphward before they enter the model call.
Should I scan at ingestion time, retrieval time, or both?
Ingestion-time scanning is the most cost-effective because you scan once per document rather than once per retrieval (a document retrieved thousands of times over its lifetime would be scanned thousands of times at retrieval). However, ingestion-time scanning has a gap: if the adversarial payload corpus is updated and a previously-safe document is now detectable, the stored document is not re-evaluated. For high-stakes corpora with anonymous write access, combine ingestion-time scanning (gate the corpus) with periodic re-scans of the existing store. See the five-step architecture in the OWASP LLM03 page for the full pattern.
What DPI should I render PDFs at?
150 DPI is a good default. FigStep attacks typically embed instructions at a minimum size of 30×30 pixels, which is detectable at 150 DPI. 300 DPI catches lower-resolution payloads but roughly quadruples the PNG file size per page and increases scan latency. For most enterprise document corpora (reports, contracts, presentations), 150 DPI is sufficient. For corpora where documents may include very small embedded images (e.g., scanned receipts with dense text), use 200–300 DPI.
What about DOCX, PPTX, and other office formats?
DOCX and PPTX documents use a ZIP-based container format (python-docx and python-pptx can extract the embedded images directly — no need to render pages). Extract each embedded image as bytes and POST them to Glyphward. The attack pattern is the same: an attacker embeds an adversarial PNG inside the OOXML archive, text extractors miss it, vision models read it. PDF is the most common enterprise format but DOCX is a close second in many RAG corpora.
Does scanning work for encrypted PDFs?
Glyphward cannot scan pages in password-protected PDFs that have not been decrypted. Your ingestion pipeline typically decrypts documents with an owner or user password before processing — scan after decryption, not before. Never log or persist the decrypted bytes beyond the scan call.
Further reading
- Prompt-injection scanner for RAG pipelines — full RAG indirect-PI architecture covering PDF, audio, and multi-tenant RAG patterns.
- OWASP LLM03:2025 training data poisoning — the dataset-level attack that targets RAG corpora and fine-tuning sets.
- FigStep detection — the typographic attack class embedded in PDF images.
- Indirect prompt injection via images — the retrieval-path attack model explained.
- Prompt-injection scanner for LangChain agents — LangChain-specific integration pattern.
- Prompt-injection scanner for LlamaIndex agents — LlamaIndex-specific integration pattern with PyMuPDF.
- ISO 27001:2022 AI security controls — A.8.28 evidence requirements for external data sources.
- SOC 2 AI security controls — CC6.6 evidence requirements for inference-boundary input scanning.