ICP-by-product · Hugging Face Transformers
Prompt-injection scanner for Hugging Face Transformers
Hugging Face transformers is the most widely used Python library for running vision-language models locally or on managed inference endpoints. When you load LLaVA, InstructBLIP, Idefics, PaliGemma, or any other multimodal checkpoint from the Hub, the library's AutoProcessor encodes user-supplied images into pixel tensors and the model's generate() call passes those tensors to the vision encoder — without any prompt-injection scan in between. A FigStep-class typographic payload hidden in the image bytes reaches the vision encoder unfiltered. The Hugging Face pipeline() helper, the Trainer API, and Inference Endpoints all share this gap. Scan the image bytes before they become tensors.
TL;DR
Before calling model.generate(**inputs) on any vision-language model loaded with Hugging Face Transformers, POST the raw image bytes to Glyphward's /v1/scan endpoint. If the score exceeds your threshold, reject the request before it reaches the model. One POST, under 200 ms, returns a 0–100 score plus the flagged pixel region. Free tier: 10 scans/day, no card. Start on the free tier.
How Hugging Face Transformers handles multimodal inputs
The typical multimodal inference pattern in Transformers is:
from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
image = Image.open("user_upload.png")
prompt = "USER: <image>\nDescribe this image.\nASSISTANT:"
inputs = processor(text=prompt, images=image, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200)
The processor step converts the PIL.Image object into a pixel_values tensor. From that point forward the image exists only as floating-point numbers — any upstream scan that operates on text (a string filter, a Lakera Guard call, an Azure Prompt Shields check) never sees the pixel bytes. The bytes are the attack surface. The generate() call materialises the injected instruction by passing those pixels directly to the model's vision encoder.
This is not a Hugging Face oversight — the transformers library is a model-serving layer, not a security layer. Safety work in the Transformers ecosystem (content filters, RLHF, Constitutional AI training) happens at model training time and cannot retroactively block a FigStep payload because the model never saw adversarial pixel-encoded prompts in its safety training distribution. The scan must operate on the image bytes before the processor call, while you still have the raw file.
Python intercept — before AutoProcessor and model.generate()
Add a scan helper that receives the raw bytes before the processor converts them to tensors:
import io
import base64
import httpx
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration
GLYPHWARD_API_KEY = "YOUR_GLYPHWARD_API_KEY" # use env var in production
GLYPHWARD_SCAN_URL = "https://glyphward.com/v1/scan"
SCORE_THRESHOLD = 70 # reject above this; lower for agentic pipelines
def scan_image_bytes(image_bytes: bytes, source: str = "user") -> dict:
"""POST raw image bytes to Glyphward. Returns scan result dict."""
encoded = base64.b64encode(image_bytes).decode()
resp = httpx.post(
GLYPHWARD_SCAN_URL,
headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
json={"image": encoded, "source": source},
timeout=5.0,
)
resp.raise_for_status()
return resp.json() # {score, flagged_region, scan_id, modality}
def safe_llava_generate(image_bytes: bytes, prompt: str, source: str = "user") -> str:
"""Scan image then run LLaVA generate. Raises ValueError on threshold breach."""
result = scan_image_bytes(image_bytes, source=source)
if result["score"] >= SCORE_THRESHOLD:
raise ValueError(
f"Image PI scan score {result['score']} exceeds threshold {SCORE_THRESHOLD}. "
f"scan_id={result['scan_id']} flagged_region={result.get('flagged_region')}"
)
# only reach here if scan passed
image = Image.open(io.BytesIO(image_bytes))
processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
inputs = processor(text=prompt, images=image, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200)
return processor.decode(output[0], skip_special_tokens=True)
In production, load the processor and model once at application startup rather than per-request. The scan call is the only addition to the normal inference path — it operates on the original bytes before any tensor conversion, so no information is lost before the scan.
The pipeline() API: same gap, same fix
Hugging Face's high-level pipeline() API is convenient but does not add a scan layer:
from transformers import pipeline
from PIL import Image
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
result = vqa(Image.open("user_upload.png"), question="What does this say?")
The pipeline call accepts PIL.Image objects directly. Add the scan before passing the image to the pipeline:
import io, base64, httpx
from PIL import Image
from transformers import pipeline
def safe_vqa_pipeline(image_bytes: bytes, question: str) -> dict:
result = scan_image_bytes(image_bytes) # same helper as above
if result["score"] >= SCORE_THRESHOLD:
raise ValueError(f"PI scan blocked image: score={result['score']}")
image = Image.open(io.BytesIO(image_bytes))
vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
return vqa(image, question=question)
The pattern is the same regardless of the pipeline task — image-to-text, visual-question-answering, document-question-answering, and zero-shot-image-classification all accept images that could carry adversarial payloads. Add the scan gate wherever the raw bytes arrive.
InstructBLIP, Idefics, PaliGemma, and other VLMs
The same intercept pattern applies across all vision-language model architectures available on the Hugging Face Hub. Different models use different processor classes, but the scan point is always the same — before the processor call, while you still have raw image bytes:
# InstructBLIP (Salesforce)
from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration
def safe_instructblip_generate(image_bytes: bytes, prompt: str) -> str:
result = scan_image_bytes(image_bytes)
if result["score"] >= SCORE_THRESHOLD:
raise ValueError(f"Blocked: score={result['score']}")
image = Image.open(io.BytesIO(image_bytes))
processor = InstructBlipProcessor.from_pretrained(
"Salesforce/instructblip-vicuna-7b"
)
model = InstructBlipForConditionalGeneration.from_pretrained(
"Salesforce/instructblip-vicuna-7b"
)
inputs = processor(images=image, text=prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=256)
return processor.batch_decode(outputs, skip_special_tokens=True)[0]
# PaliGemma (Google, via transformers >= 4.41)
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor as PGProcessor
def safe_paligemma_generate(image_bytes: bytes, prompt: str) -> str:
result = scan_image_bytes(image_bytes)
if result["score"] >= SCORE_THRESHOLD:
raise ValueError(f"Blocked: score={result['score']}")
image = Image.open(io.BytesIO(image_bytes))
processor = PGProcessor.from_pretrained("google/paligemma-3b-pt-224")
model = PaliGemmaForConditionalGeneration.from_pretrained(
"google/paligemma-3b-pt-224"
)
inputs = processor(text=prompt, images=image, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=100)
return processor.decode(outputs[0], skip_special_tokens=True)
For Idefics2 and Idefics3 (from Hugging Face / HuggingFaceM4), the processor accepts a messages list containing image and text content blocks — analogous to the OpenAI Chat Completions format. Walk the messages list and scan every image block before the processor call:
from transformers import AutoProcessor, AutoModelForVision2Seq
def safe_idefics_generate(messages: list, image_bytes_map: dict) -> str:
"""
messages: Idefics-format list with {"role": ..., "content": [...]} items.
image_bytes_map: {image_id: bytes} — raw bytes for each image in the messages list.
"""
for img_id, img_bytes in image_bytes_map.items():
r = scan_image_bytes(img_bytes)
if r["score"] >= SCORE_THRESHOLD:
raise ValueError(f"Image {img_id} blocked: score={r['score']}")
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b")
images = [Image.open(io.BytesIO(b)) for b in image_bytes_map.values()]
inputs = processor(text=processor.apply_chat_template(messages), images=images,
return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=500)
return processor.decode(output[0], skip_special_tokens=True)
Hugging Face Inference Endpoints and the Serverless API
When you call a multimodal model through Hugging Face Inference Endpoints (your own dedicated endpoint) or the Serverless Inference API, the request travels from your application code to the Hugging Face infrastructure. Add the Glyphward scan in your application before dispatching the request — the scan operates on the raw bytes before they leave your process, regardless of where the model runs:
import httpx, base64, io
from PIL import Image
HF_API_TOKEN = "YOUR_HF_API_TOKEN"
ENDPOINT_URL = "https://api-inference.huggingface.co/models/llava-hf/llava-1.5-7b-hf"
def safe_hf_inference_endpoint(image_bytes: bytes, prompt: str) -> dict:
# Step 1: scan before the network call
scan = scan_image_bytes(image_bytes)
if scan["score"] >= SCORE_THRESHOLD:
raise ValueError(f"Blocked by PI scan: score={scan['score']}")
# Step 2: dispatch to Hugging Face endpoint
encoded = base64.b64encode(image_bytes).decode()
resp = httpx.post(
ENDPOINT_URL,
headers={"Authorization": f"Bearer {HF_API_TOKEN}"},
json={"inputs": {"image": encoded, "prompt": prompt}},
)
resp.raise_for_status()
return resp.json()
Hugging Face's Inference Endpoints do not add a PI scan layer at the endpoint level. The scan must live in your application code.
Indirect PI: Hugging Face Datasets with image columns
A subtler attack surface is the Hugging Face Datasets library. Teams commonly fine-tune multimodal models on datasets that include image columns, and RAG pipelines sometimes use datasets as their document store. An adversary who contributes to a dataset (or poisons a third-party dataset your pipeline downloads) can embed FigStep-class payloads in individual image rows. Those images then appear in fine-tuning batches or retrieval results — the OWASP LLM03:2025 training data poisoning vector in its Hugging Face form.
Scan images from datasets before they enter training or retrieval loops:
from datasets import load_dataset
def scan_dataset_images(dataset_name: str, image_col: str = "image",
split: str = "train", sample_n: int = None):
"""Scan all (or first sample_n) images in a Hugging Face dataset."""
ds = load_dataset(dataset_name, split=split)
flagged = []
for i, row in enumerate(ds):
if sample_n and i >= sample_n:
break
img: Image.Image = row[image_col]
buf = io.BytesIO()
img.save(buf, format="PNG")
scan = scan_image_bytes(buf.getvalue(), source=f"dataset:{dataset_name}")
if scan["score"] >= SCORE_THRESHOLD:
flagged.append({"index": i, "score": scan["score"], "scan_id": scan["scan_id"]})
return flagged
Run this scan against any third-party dataset before ingestion, and log the results as the dataset-provenance record required for ISO 27001 A.8.28 and SOC 2 CC6.6 evidence.
Coverage matrix
How Glyphward compares to other tools in the Hugging Face ecosystem for multimodal PI detection:
| Tool | Image-PI detection | HF pipeline integration | Dataset pre-scan | Self-serve free tier |
|---|---|---|---|---|
| Lakera Guard | Text inputs only | Text path only | No | No (enterprise) |
| LLM Guard | Text inputs only | Text path only | No | Yes (OSS, text) |
| Azure Prompt Shields | Text inputs only | No (Azure-gated) | No | No (Azure-gated) |
| Promptfoo | Eval-time only | No (test harness) | No | Yes (eval-time) |
| Glyphward | Image + audio bytes | Pre-generate() wrapper | Dataset column scan | Yes — 10 scans/day free |
None of the text-only tools have an intercept point in the Hugging Face Transformers pipeline that reaches pixel bytes — they operate on string-format prompts. Glyphward operates on the raw image bytes, before the AutoProcessor call converts them to tensors.
Related questions
Does Hugging Face have any built-in safety for vision model inputs?
Hugging Face provides content moderation tools through the evaluate library and recommends RLHF / Constitutional AI training for chat-tuned models. These operate at model-weight level and affect output distribution for text-format adversarial inputs. They do not include an inference-time image PI scanner. A FigStep or AgentTypo payload embedded in image pixels is not in the adversarial training distribution of any current Hub model, so safety training does not reliably block it. You must add an explicit scan gate before the generate() call.
Does this apply to models run locally via Ollama or llama.cpp?
Yes, if the locally-served model accepts image inputs. Ollama supports LLaVA and Moondream2 via its /api/generate endpoint with a base64 images field. Add the Glyphward scan before the Ollama API call, not after the model returns. The attack surface is the bytes entering the vision encoder, not the model's server implementation.
What about the transformers Agents class?
Hugging Face's experimental transformers.agents module (ReactCodeAgent, ToolCallingAgent) can invoke tools that return images as outputs, feeding them back into the agent's multimodal context. Scan image tool results before they re-enter the agent loop — the agentic escalation risk is higher than for single-turn inference because a poisoned image can hijack downstream tool calls. Use a threshold of 50 (vs 70 for user uploads) to be conservative on agent-loop content.
Do I need to scan images in the training data?
If you are fine-tuning on a dataset that includes user-contributed or third-party images — especially a dataset downloaded from the Hub where multiple parties can contribute — yes, scan representative samples. A training data poisoning attack embeds adversarial pixel triggers in fine-tuning images. After training completes, the trigger is baked into the model weights and fires in production without any per-request image needed. Scanning pre-training is the correct control point.
What is the latency impact of the scan on inference throughput?
Glyphward's scan endpoint returns in under 200 ms at p95. For batch-inference pipelines, you can pre-scan the full input batch concurrently before starting generation — Python's asyncio or a threadpool executor works well for this. For interactive single-turn inference, 200 ms is typically within acceptable latency budget. For streaming generation, scan the image once before the stream starts — the scan does not need to be repeated on each generated token.
Further reading
- FigStep detection — the typographic attack class vision encoders are exposed to.
- AgentTypo detector — glyph-distortion variant that defeats standard OCR.
- Vision language model security — category overview of the VLM inference-boundary attack surface.
- OWASP LLM03:2025 training data poisoning — the dataset-level attack that targets fine-tuning corpora.
- Prompt-injection scanner for RAG pipelines — pre-ingestion scan for document corpora fed to retrieval pipelines.
- Prompt-injection scanner for LangChain agents — the equivalent pattern for LangChain-wrapped inference.
- Indirect prompt injection via images — the RAG / tool-result retrieval attack path explained.
- Multimodal LLM security API — category-level overview of the scan endpoint.