Platform guide · HuggingFace Inference Endpoints
Prompt injection scanner for HuggingFace Inference Endpoints
HuggingFace Inference Endpoints lets teams deploy any open-weight model — including vision-language models (VLMs) like LLaVA-1.6, IDEFICS-3, InternVL2, Phi-3.5-Vision, and Qwen2-VL — as a scalable production API in minutes, without managing GPU infrastructure. Unlike the HuggingFace Transformers library used for local inference, Inference Endpoints is a hosted service: you point it at a Hub model, configure compute, and receive a dedicated HTTPS endpoint that your application POSTs image+text payloads to directly. HuggingFace's platform-level safeguards focus on content filtering at the output layer and Hub model access controls — they do not inspect the pixel content of images submitted to VLM endpoints for adversarially embedded instructions. An attacker who can influence what images your application forwards to an Inference Endpoint — through user uploads, document ingestion, web scraping, or data pipeline inputs — can inject natural-language directives hidden in pixel patterns that the model interprets as instructions, bypassing your system prompt without any network-level signal. Glyphward's pre-model scan gate intercepts image bytes before they reach the Inference Endpoint, blocking pixel-level injections that HuggingFace's own infrastructure cannot see.
TL;DR
Before forwarding any image to a HuggingFace Inference Endpoint (serverless or dedicated), POST the image bytes to https://glyphward.com/v1/scan and reject the request if score >= 65. The scan adds under 200 ms p95 latency — less than a typical cold-start on a serverless endpoint. The same scan call covers all four attack surfaces described below regardless of which VLM is deployed. Free tier — 10 scans/day, no card required.
The four multimodal attack surfaces in HuggingFace Inference Endpoints
1. Serverless Inference API — shared VLM endpoints with no per-tenant isolation. HuggingFace's Serverless Inference API provides free or low-rate-limited access to popular VLMs hosted on the Hub, including LLaVA, IDEFICS, and Phi-3-Vision, via a simple OpenAI-compatible chat/completions endpoint. Teams prototyping multimodal features frequently start with the Serverless API before graduating to dedicated endpoints — and many production-adjacent integrations remain on the serverless tier to avoid dedicated endpoint costs. Because the serverless tier auto-provisions shared compute, there is no per-tenant ingestion filter: every image submitted to https://api-inference.huggingface.co/models/<model-id> is passed directly to the model's input pipeline. An adversarially crafted image embedded in a user-uploaded document or screenshotted from a malicious web page will reach the model's vision encoder unscanned, giving the injection payload a direct channel to the model's instruction-following stack.
2. Dedicated Inference Endpoints with custom inference handlers. Dedicated Inference Endpoints allow teams to deploy any Hub model with a custom handler.py that pre-processes inputs, applies custom tokenisation, or adds application-specific logic before model inference. A common pattern is to override the default pipeline handler to add structured output parsing, custom system prompt injection, or multi-image batch support. These custom handlers run inside the endpoint container with no platform-enforced content inspection step — HuggingFace's platform trusts that the handler author has implemented their own input validation. In practice, most custom handlers focus on formatting and parsing, not security scanning. An image that arrives at a dedicated endpoint with a custom handler bypasses both the default HuggingFace content filters (which apply at the model-hosted-API layer, not the custom container layer) and any security scanning that the handler author has not explicitly added. The combination of elevated capabilities (custom system prompts, tool-call formatting, structured output) and absent input scanning makes custom-handler endpoints the highest-risk HuggingFace deployment configuration.
3. Multi-image batch inference for document processing pipelines. Both the Serverless API and dedicated endpoints support multi-image batching — submitting an array of images in a single request for batch extraction (invoices, product catalogue pages, scanned forms, slides). Document processing pipelines built on HuggingFace Inference Endpoints commonly fetch a set of PDF page images from S3, GCS, or a document management system and submit them as a batch to a VLM for structured field extraction. In a batch of 30 PDF page images, a single adversarially crafted page — planted by a supplier who can influence what documents enter the pipeline, or introduced through a compromised ingestion step — will be processed alongside the legitimate pages with the same model context and system prompt. The adversarial instruction on that one page can override the extraction behaviour for the entire batch or for all subsequent pages in a long-context multi-image call, since VLMs process images sequentially within a single context window. Scanning images individually before batch submission is the only reliable mitigation.
4. HuggingFace Spaces inference backends exposed as API targets. HuggingFace Spaces frequently serve dual duty as both interactive demos and production inference backends — teams embed a Gradio or Streamlit Space's /predict endpoint in their application code, bypassing the Spaces UI. Spaces running VLMs (a common pattern for open-source multimodal demos) accept arbitrary image inputs from the application layer with no HuggingFace platform-level PI scanning; the Space owner is solely responsible for input validation. When a Space's endpoint is used as a production backend — often for cost reasons, since public Spaces run on free HuggingFace infrastructure — it inherits all the access-control and input-scanning gaps of the underlying Gradio or FastAPI handler. Any application POSTing images to a Space-backed VLM endpoint is exposed to the same multimodal injection risk as a dedicated endpoint, with the additional risk that the Space's system prompt may be visible in the public source code, giving attackers prior knowledge of the instruction context they are trying to override.
Integration: HuggingFace Inference Client with Glyphward pre-scan gate
import base64
import requests
from huggingface_hub import InferenceClient
HF_TOKEN = "<your-huggingface-token>"
ENDPOINT_URL = "<your-dedicated-endpoint-url>" # or use model ID for serverless
GLYPHWARD_KEY = "<your-glyphward-api-key>"
GLYPHWARD_THRESHOLD = 65 # fail-closed for document and user-upload workloads
client = InferenceClient(
model=ENDPOINT_URL,
token=HF_TOKEN,
)
def scan_image_for_injection(image_bytes: bytes) -> dict:
"""Scan image bytes for multimodal prompt injection before HF inference call."""
encoded = base64.b64encode(image_bytes).decode()
resp = requests.post(
"https://glyphward.com/v1/scan",
json={"image": encoded, "source": "huggingface_inference_endpoints"},
headers={"Authorization": f"Bearer {GLYPHWARD_KEY}"},
timeout=8,
)
resp.raise_for_status()
return resp.json()
def describe_image(image_bytes: bytes, prompt: str = "Describe the content of this image.") -> str:
"""Gate image through Glyphward before forwarding to HuggingFace Inference Endpoint."""
try:
scan = scan_image_for_injection(image_bytes)
except Exception as exc:
raise RuntimeError(
"Image security scan unavailable — request blocked. Please retry."
) from exc
if scan["score"] >= GLYPHWARD_THRESHOLD:
raise ValueError(
f"Image blocked: adversarial content detected "
f"(score {scan['score']}/100, scan_id={scan['scan_id']})"
)
# Safe — forward to HuggingFace Inference Endpoint
encoded = base64.b64encode(image_bytes).decode()
response = client.chat_completion(
messages=[
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": f"data:image/png;base64,{encoded}"}},
{"type": "text", "text": prompt},
],
}
],
max_tokens=512,
)
return response.choices[0].message.content
def process_document_batch(image_bytes_list: list[bytes], prompt: str) -> list[str]:
"""Scan all batch images before any inference call — reject entire batch on first fail."""
for i, img_bytes in enumerate(image_bytes_list):
scan = scan_image_for_injection(img_bytes)
if scan["score"] >= GLYPHWARD_THRESHOLD:
raise ValueError(
f"Batch aborted: image at index {i} blocked "
f"(score {scan['score']}/100, scan_id={scan['scan_id']})"
)
# All images passed — submit to endpoint
return [describe_image(img, prompt) for img in image_bytes_list]
The process_document_batch() function scans all images in a batch before submitting any to the endpoint — this is the correct pattern for document pipeline batch jobs, because a single adversarial image in the middle of a 30-page batch should abort the entire batch, not just its own inference call. Partial batches (rejecting only the flagged image and continuing with the rest) are appropriate only if your downstream processing is strictly per-page with no shared context; most VLM multi-image calls share context across all images in the request. The source field in the scan request lets Glyphward's corpus tagging distinguish HuggingFace Inference Endpoint traffic for reporting and false-positive tuning. For serverless API use, replace ENDPOINT_URL with the model ID string (e.g. "llava-hf/llava-1.6-mistral-7b-hf"); the scan gate is identical. Get early access
Coverage matrix
| Defence layer | Serverless API (shared VLMs) | Dedicated Endpoint (custom handler) | Batch document processing | Spaces-backed inference API |
|---|---|---|---|---|
| HuggingFace Hub content policy | No — model hosting policy, not runtime input scanning | No | No | No |
| HuggingFace Inference Endpoints output filters | Partial — optional output moderation, not input PI detection | No — bypassed by custom handler container | No | No |
| Gradio/Streamlit input validation | N/A | N/A | N/A | No — framework validates type/size, not pixel content |
| Model system prompt / instruction tuning | No — PI payloads override system prompts by design | No | No | No |
| Glyphward pre-model scan | Yes — scan before every InferenceClient call | Yes — scan before POST to dedicated endpoint URL | Yes — scan each image before batch submission, abort on first failure | Yes — scan before POST to Space /predict endpoint |
Related questions
Does HuggingFace's Inference Endpoints platform scan image inputs for prompt injection?
No. HuggingFace Inference Endpoints provides infrastructure for hosting and scaling models — GPU provisioning, autoscaling, TLS termination, and optional output-layer content filtering (available on some Hub-hosted models). It does not provide input-side inspection of image pixel content for adversarially embedded instructions. The platform's security model assumes that application developers are responsible for validating and sanitising inputs before forwarding them to the model. This is consistent with how cloud ML inference services generally work: the infrastructure layer handles availability and access control; the application layer handles input security. Glyphward sits at the application layer, between your application code and the Inference Endpoint URL, scanning image bytes before they are forwarded.
How is this different from the HuggingFace Transformers library?
The HuggingFace Transformers page covers the local Python library (from transformers import pipeline) used for self-hosted inference on your own GPU or CPU. In that deployment model, image preprocessing, model loading, and inference all happen inside your own process — you control the entire stack and can add scan gates anywhere. HuggingFace Inference Endpoints is a different product: you call an external HTTPS endpoint that HuggingFace provisions and manages. The security boundary is the HTTP request: you POST image bytes across a network to a model running on HuggingFace's infrastructure. The scan gate pattern (POST to Glyphward before POST to the Inference Endpoint) is the same regardless of the model, but the deployment context — managed external API vs local library — is what distinguishes the two pages.
Can the Glyphward scan gate be added inside a custom inference handler?
Yes, and this is the recommended pattern for dedicated endpoints with custom handler.py files. In your EndpointHandler.__call__ method, call the Glyphward API with the raw image bytes extracted from the input payload before passing the image to the model's tokeniser or pipeline. This places the scan inside the endpoint container itself, which ensures that any call to the endpoint — regardless of whether it originated from your authorised application or an unexpected API client — passes through the scan gate. For serverless endpoints where you cannot add a custom handler, the scan gate must be added in the application code that calls the Inference API, before the InferenceClient call is made.
What VLMs on HuggingFace Hub are most commonly targeted?
The highest-risk VLMs for multimodal prompt injection are those with strong instruction-following capabilities that are also widely deployed for document and image understanding tasks: LLaVA-1.6-Mistral-7B, IDEFICS-3-8B-Llama3, InternVL2-8B, Phi-3.5-Vision-Instruct, Qwen2-VL-7B-Instruct, and SmolVLM. These models are instruction-tuned, meaning they are specifically trained to follow directives embedded in their input — which is exactly what makes them useful for document analysis and also what makes them susceptible to adversarial directives hidden in image pixels. The same capability that allows Phi-3.5-Vision to extract structured fields from a scanned invoice also allows it to follow a directive hidden in a pixel pattern on that invoice. The scan threshold of 65 is calibrated against the known attack corpuses (FigStep, AgentTypo, typographic PI) that these instruction-tuned models are most sensitive to.
Does scanning before the Inference Endpoint call add meaningful latency?
The Glyphward scan API returns results in under 200 ms at p95 for typical document image sizes (JPEG/PNG under 5 MB). A HuggingFace Inference Endpoint cold-start on a serverless deployment can take 10–30 seconds; a warm dedicated endpoint call for a 7B VLM typically takes 1–4 seconds per image. The Glyphward pre-scan adds at most 5–10% to a warm endpoint call latency and is negligible against a cold start. For latency-sensitive workloads, the scan can be issued in parallel with the endpoint warm-up ping (a lightweight empty POST to keep the endpoint hot), so the scan completes before the model is ready to receive the image. For batch workloads, scans across a 30-image batch can be issued concurrently with asyncio.gather() — the total scan latency for 30 images in parallel is approximately equal to the single-image p95.
Further reading
- Prompt injection scanner for HuggingFace Transformers — the same multimodal gate pattern for self-hosted Transformers pipelines (local GPU or CPU inference, no external API call)
- Prompt injection scanner for Gradio and Streamlit apps — covering
gr.Image()andst.file_uploader()entry points, often backed by HuggingFace Inference Endpoints or Spaces - Vision-language model security — architecture overview of how VLM vision encoders process adversarial pixel patterns and why CLIP-space embeddings are the correct inspection point
- Agentic RAG pipeline prompt injection — how adversarial document images propagate through retrieval-augmented generation pipelines, including HuggingFace-hosted embedding and reranking models
- Prompt injection in autonomous AI research agents — covering multi-image batch injection in long-running research agents that use HuggingFace models to process arXiv papers and web-scraped content