ICP-by-product · Hugging Face Transformers

Prompt-injection scanner for Hugging Face Transformers

Hugging Face transformers is the most widely used Python library for running vision-language models locally or on managed inference endpoints. When you load LLaVA, InstructBLIP, Idefics, PaliGemma, or any other multimodal checkpoint from the Hub, the library's AutoProcessor encodes user-supplied images into pixel tensors and the model's generate() call passes those tensors to the vision encoder — without any prompt-injection scan in between. A FigStep-class typographic payload hidden in the image bytes reaches the vision encoder unfiltered. The Hugging Face pipeline() helper, the Trainer API, and Inference Endpoints all share this gap. Scan the image bytes before they become tensors.

TL;DR

Before calling model.generate(**inputs) on any vision-language model loaded with Hugging Face Transformers, POST the raw image bytes to Glyphward's /v1/scan endpoint. If the score exceeds your threshold, reject the request before it reaches the model. One POST, under 200 ms, returns a 0–100 score plus the flagged pixel region. Free tier: 10 scans/day, no card. Start on the free tier.

How Hugging Face Transformers handles multimodal inputs

The typical multimodal inference pattern in Transformers is:

from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image

processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")

image = Image.open("user_upload.png")
prompt = "USER: <image>\nDescribe this image.\nASSISTANT:"

inputs = processor(text=prompt, images=image, return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=200)

The processor step converts the PIL.Image object into a pixel_values tensor. From that point forward the image exists only as floating-point numbers — any upstream scan that operates on text (a string filter, a Lakera Guard call, an Azure Prompt Shields check) never sees the pixel bytes. The bytes are the attack surface. The generate() call materialises the injected instruction by passing those pixels directly to the model's vision encoder.

This is not a Hugging Face oversight — the transformers library is a model-serving layer, not a security layer. Safety work in the Transformers ecosystem (content filters, RLHF, Constitutional AI training) happens at model training time and cannot retroactively block a FigStep payload because the model never saw adversarial pixel-encoded prompts in its safety training distribution. The scan must operate on the image bytes before the processor call, while you still have the raw file.

Python intercept — before AutoProcessor and model.generate()

Add a scan helper that receives the raw bytes before the processor converts them to tensors:

import io
import base64
import httpx
from PIL import Image
from transformers import AutoProcessor, LlavaForConditionalGeneration

GLYPHWARD_API_KEY = "YOUR_GLYPHWARD_API_KEY"  # use env var in production
GLYPHWARD_SCAN_URL = "https://glyphward.com/v1/scan"
SCORE_THRESHOLD = 70  # reject above this; lower for agentic pipelines

def scan_image_bytes(image_bytes: bytes, source: str = "user") -> dict:
    """POST raw image bytes to Glyphward. Returns scan result dict."""
    encoded = base64.b64encode(image_bytes).decode()
    resp = httpx.post(
        GLYPHWARD_SCAN_URL,
        headers={"Authorization": f"Bearer {GLYPHWARD_API_KEY}"},
        json={"image": encoded, "source": source},
        timeout=5.0,
    )
    resp.raise_for_status()
    return resp.json()  # {score, flagged_region, scan_id, modality}

def safe_llava_generate(image_bytes: bytes, prompt: str, source: str = "user") -> str:
    """Scan image then run LLaVA generate. Raises ValueError on threshold breach."""
    result = scan_image_bytes(image_bytes, source=source)
    if result["score"] >= SCORE_THRESHOLD:
        raise ValueError(
            f"Image PI scan score {result['score']} exceeds threshold {SCORE_THRESHOLD}. "
            f"scan_id={result['scan_id']} flagged_region={result.get('flagged_region')}"
        )
    # only reach here if scan passed
    image = Image.open(io.BytesIO(image_bytes))
    processor = AutoProcessor.from_pretrained("llava-hf/llava-1.5-7b-hf")
    model = LlavaForConditionalGeneration.from_pretrained("llava-hf/llava-1.5-7b-hf")
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=200)
    return processor.decode(output[0], skip_special_tokens=True)

In production, load the processor and model once at application startup rather than per-request. The scan call is the only addition to the normal inference path — it operates on the original bytes before any tensor conversion, so no information is lost before the scan.

Get early access

The pipeline() API: same gap, same fix

Hugging Face's high-level pipeline() API is convenient but does not add a scan layer:

from transformers import pipeline
from PIL import Image

vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
result = vqa(Image.open("user_upload.png"), question="What does this say?")

The pipeline call accepts PIL.Image objects directly. Add the scan before passing the image to the pipeline:

import io, base64, httpx
from PIL import Image
from transformers import pipeline

def safe_vqa_pipeline(image_bytes: bytes, question: str) -> dict:
    result = scan_image_bytes(image_bytes)  # same helper as above
    if result["score"] >= SCORE_THRESHOLD:
        raise ValueError(f"PI scan blocked image: score={result['score']}")
    image = Image.open(io.BytesIO(image_bytes))
    vqa = pipeline("visual-question-answering", model="Salesforce/blip-vqa-base")
    return vqa(image, question=question)

The pattern is the same regardless of the pipeline task — image-to-text, visual-question-answering, document-question-answering, and zero-shot-image-classification all accept images that could carry adversarial payloads. Add the scan gate wherever the raw bytes arrive.

InstructBLIP, Idefics, PaliGemma, and other VLMs

The same intercept pattern applies across all vision-language model architectures available on the Hugging Face Hub. Different models use different processor classes, but the scan point is always the same — before the processor call, while you still have raw image bytes:

# InstructBLIP (Salesforce)
from transformers import InstructBlipProcessor, InstructBlipForConditionalGeneration

def safe_instructblip_generate(image_bytes: bytes, prompt: str) -> str:
    result = scan_image_bytes(image_bytes)
    if result["score"] >= SCORE_THRESHOLD:
        raise ValueError(f"Blocked: score={result['score']}")
    image = Image.open(io.BytesIO(image_bytes))
    processor = InstructBlipProcessor.from_pretrained(
        "Salesforce/instructblip-vicuna-7b"
    )
    model = InstructBlipForConditionalGeneration.from_pretrained(
        "Salesforce/instructblip-vicuna-7b"
    )
    inputs = processor(images=image, text=prompt, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=256)
    return processor.batch_decode(outputs, skip_special_tokens=True)[0]

# PaliGemma (Google, via transformers >= 4.41)
from transformers import PaliGemmaForConditionalGeneration, AutoProcessor as PGProcessor

def safe_paligemma_generate(image_bytes: bytes, prompt: str) -> str:
    result = scan_image_bytes(image_bytes)
    if result["score"] >= SCORE_THRESHOLD:
        raise ValueError(f"Blocked: score={result['score']}")
    image = Image.open(io.BytesIO(image_bytes))
    processor = PGProcessor.from_pretrained("google/paligemma-3b-pt-224")
    model = PaliGemmaForConditionalGeneration.from_pretrained(
        "google/paligemma-3b-pt-224"
    )
    inputs = processor(text=prompt, images=image, return_tensors="pt")
    outputs = model.generate(**inputs, max_new_tokens=100)
    return processor.decode(outputs[0], skip_special_tokens=True)

For Idefics2 and Idefics3 (from Hugging Face / HuggingFaceM4), the processor accepts a messages list containing image and text content blocks — analogous to the OpenAI Chat Completions format. Walk the messages list and scan every image block before the processor call:

from transformers import AutoProcessor, AutoModelForVision2Seq

def safe_idefics_generate(messages: list, image_bytes_map: dict) -> str:
    """
    messages: Idefics-format list with {"role": ..., "content": [...]} items.
    image_bytes_map: {image_id: bytes} — raw bytes for each image in the messages list.
    """
    for img_id, img_bytes in image_bytes_map.items():
        r = scan_image_bytes(img_bytes)
        if r["score"] >= SCORE_THRESHOLD:
            raise ValueError(f"Image {img_id} blocked: score={r['score']}")
    processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
    model = AutoModelForVision2Seq.from_pretrained("HuggingFaceM4/idefics2-8b")
    images = [Image.open(io.BytesIO(b)) for b in image_bytes_map.values()]
    inputs = processor(text=processor.apply_chat_template(messages), images=images,
                       return_tensors="pt")
    output = model.generate(**inputs, max_new_tokens=500)
    return processor.decode(output[0], skip_special_tokens=True)

Hugging Face Inference Endpoints and the Serverless API

When you call a multimodal model through Hugging Face Inference Endpoints (your own dedicated endpoint) or the Serverless Inference API, the request travels from your application code to the Hugging Face infrastructure. Add the Glyphward scan in your application before dispatching the request — the scan operates on the raw bytes before they leave your process, regardless of where the model runs:

import httpx, base64, io
from PIL import Image

HF_API_TOKEN = "YOUR_HF_API_TOKEN"
ENDPOINT_URL = "https://api-inference.huggingface.co/models/llava-hf/llava-1.5-7b-hf"

def safe_hf_inference_endpoint(image_bytes: bytes, prompt: str) -> dict:
    # Step 1: scan before the network call
    scan = scan_image_bytes(image_bytes)
    if scan["score"] >= SCORE_THRESHOLD:
        raise ValueError(f"Blocked by PI scan: score={scan['score']}")
    # Step 2: dispatch to Hugging Face endpoint
    encoded = base64.b64encode(image_bytes).decode()
    resp = httpx.post(
        ENDPOINT_URL,
        headers={"Authorization": f"Bearer {HF_API_TOKEN}"},
        json={"inputs": {"image": encoded, "prompt": prompt}},
    )
    resp.raise_for_status()
    return resp.json()

Hugging Face's Inference Endpoints do not add a PI scan layer at the endpoint level. The scan must live in your application code.

Get early access

Indirect PI: Hugging Face Datasets with image columns

A subtler attack surface is the Hugging Face Datasets library. Teams commonly fine-tune multimodal models on datasets that include image columns, and RAG pipelines sometimes use datasets as their document store. An adversary who contributes to a dataset (or poisons a third-party dataset your pipeline downloads) can embed FigStep-class payloads in individual image rows. Those images then appear in fine-tuning batches or retrieval results — the OWASP LLM03:2025 training data poisoning vector in its Hugging Face form.

Scan images from datasets before they enter training or retrieval loops:

from datasets import load_dataset

def scan_dataset_images(dataset_name: str, image_col: str = "image",
                         split: str = "train", sample_n: int = None):
    """Scan all (or first sample_n) images in a Hugging Face dataset."""
    ds = load_dataset(dataset_name, split=split)
    flagged = []
    for i, row in enumerate(ds):
        if sample_n and i >= sample_n:
            break
        img: Image.Image = row[image_col]
        buf = io.BytesIO()
        img.save(buf, format="PNG")
        scan = scan_image_bytes(buf.getvalue(), source=f"dataset:{dataset_name}")
        if scan["score"] >= SCORE_THRESHOLD:
            flagged.append({"index": i, "score": scan["score"], "scan_id": scan["scan_id"]})
    return flagged

Run this scan against any third-party dataset before ingestion, and log the results as the dataset-provenance record required for ISO 27001 A.8.28 and SOC 2 CC6.6 evidence.

Coverage matrix

How Glyphward compares to other tools in the Hugging Face ecosystem for multimodal PI detection:

Tool	Image-PI detection	HF pipeline integration	Dataset pre-scan	Self-serve free tier
Lakera Guard	Text inputs only	Text path only	No	No (enterprise)
LLM Guard	Text inputs only	Text path only	No	Yes (OSS, text)
Azure Prompt Shields	Text inputs only	No (Azure-gated)	No	No (Azure-gated)
Promptfoo	Eval-time only	No (test harness)	No	Yes (eval-time)
Glyphward	Image + audio bytes	Pre-generate() wrapper	Dataset column scan	Yes — 10 scans/day free

None of the text-only tools have an intercept point in the Hugging Face Transformers pipeline that reaches pixel bytes — they operate on string-format prompts. Glyphward operates on the raw image bytes, before the AutoProcessor call converts them to tensors.