Compliance · OWASP LLM03:2025

OWASP LLM03:2025 Training Data Poisoning — multimodal dataset attack surface

The OWASP Top 10 for LLM Applications 2025 places Training Data Poisoning at LLM03 — an attack category that covers adversarial manipulation of the data that shapes how an LLM application behaves. The attack surface has expanded well beyond academic model pre-training: in production deployments, every document ingested into a RAG retrieval corpus and every image-text pair assembled into a fine-tuning dataset is a training data input in the sense that LLM03 covers. For multimodal AI applications, that expansion has a specific consequence that no text-only data validator addresses. PDFs, PPTX decks, and web pages brought into a knowledge base routinely contain embedded images; image-text fine-tuning datasets contain pixel layers alongside their text annotations. Text chunking pipelines — LangChain document loaders, LlamaIndex readers, Unstructured.io splitters — extract text content and pass it through the ingestion pipeline. They do not inspect the pixel content of embedded images for adversarial-instruction payloads. A text-only PI scanner applied to chunked text inherits exactly the same blind spot. The result is a persistent, stealthy poisoning vector: a malicious image embedded in a retrieved PDF carries its payload into every multimodal inference that surfaces that document, without any per-request user action and without any text-side control ever seeing it. This page covers what LLM03:2025 requires for the multimodal dataset attack surface, why text-only validators cannot satisfy it, and the pre-ingestion scanning control that closes the gap.

TL;DR

OWASP LLM03:2025 Training Data Poisoning covers adversarial manipulation of fine-tuning datasets and RAG retrieval corpora, not only foundational-model pre-training. For multimodal AI systems, the attack surface includes image layers inside ingested PDFs and image-text pairs inside fine-tuning sets — content that text chunkers and text-only validators never process. A FigStep-class payload embedded in an image inside a retrieved document will reach the vision encoder on every subsequent retrieval without triggering any text-side control. Glyphward closes the multimodal LLM03 gap with pre-ingestion scanning: image bytes extracted from each document before it enters the retrieval corpus, scored against a curated adversarial payload corpus, quarantined on threshold-crossing results. The same scan doubles as fine-tuning dataset validation. The per-document scan_id is the LLM03 dataset-provenance evidence record that text-only validators cannot produce for image content.

What LLM03:2025 covers beyond foundational-model pre-training

The "training data" in OWASP LLM03:2025 is a broader category than the weight-update data a GPU cluster consumes during pre-training. The OWASP category covers any data that shapes LLM application behaviour — which in deployed systems includes fine-tuning datasets, alignment datasets, retrieval corpora, and the few-shot examples embedded in system prompts. The 2025 revision of LLM03 reflects this expanded scope: an organisation that uses a retrieval-augmented generation pipeline has a training data attack surface in the RAG corpus that is as material to its security posture as the fine-tuning data a model was adapted on.

The three principal LLM03 attack vectors in production deployments map as follows. The first is foundational pre-training poisoning: an adversary contributes malicious content to a public dataset (web crawl, code corpus, instruction-following set) that the model vendor uses during pre-training. This vector is largely outside the control of AI application developers — it concerns model selection and supply chain due diligence rather than application-layer controls. The second is fine-tuning and alignment data poisoning: an organisation assembles a proprietary dataset from internal documents, user interactions, or third-party data feeds and uses it to adapt a foundation model for their specific task. Adversarial content introduced at this stage influences the model's learned behaviour in ways that survive deployment. The third is retrieval corpus poisoning: documents are ingested into a vector store or search index that a RAG pipeline queries at inference time. Adversarial content introduced into the corpus is retrieved and presented to the model as trusted context, influencing inference-time output without any direct user injection in the current conversation.

For text-only LLM applications, text-side data validators cover vectors two and three: a text PI scanner applied during dataset ingestion can inspect the text content of documents and training examples for adversarial instructions. For multimodal applications — including chatbots with image upload, document-reading agents, avatar SaaS products, and voice agents that process image-bearing attachments — the same pipeline has a multimodal gap that text validators do not close and that LLM03:2025's dataset security requirement squarely covers.

The multimodal attack surface in RAG retrieval corpora

A retrieval-augmented generation pipeline ingests documents from internal and external sources — policy PDFs, technical manuals, support articles, slide decks, research papers, customer contracts — and indexes their content for retrieval at query time. Modern enterprise knowledge bases contain heavily heterogeneous document types, and a significant fraction of those documents contain embedded images: figures and diagrams in technical PDFs, screenshots in support articles, tables rendered as images in PPTX exports, photographs in policy documents, charts in financial reports.

Standard RAG ingestion pipelines — including the most widely deployed implementations in LangChain, LlamaIndex, and Haystack — process document content at the text layer. A PDF loader extracts text from each page and splits it into chunks; embedded images on those pages are either discarded, extracted separately as binary files without content inspection, or passed to an optional image captioning step that produces a text description for indexing. In none of these paths does a text PI scanner see the raw pixel content of the embedded image. The scanner receives text — either the extracted text surrounding the image or an OCR- or caption-derived text representation — not the image bytes themselves.

The LLM03 attack exploits this gap. An adversary who can introduce a document into the retrieval corpus — by uploading a malicious PDF to a shared knowledge base, by poisoning a third-party data feed the pipeline subscribes to, or by manipulating a web page that a crawler ingests — can embed an adversarial instruction in an image layer of that document. The image renders normally to a human reviewer: it looks like a legitimate figure or diagram. The text extracted by the PDF loader is clean. A text PI scanner applied to the ingested chunks scores them low. The document enters the corpus without triggering any control.

At inference time, a user query causes the retrieval system to surface the poisoned document. The retrieved chunk — including the embedded image or an image reference in the multimodal payload — is presented to the model alongside the user's question. The vision encoder processes the image bytes, reads the adversarial instruction embedded in the pixel layer, and executes it. The attack is complete. The user's text prompt contained nothing adversarial; every text-side inference-time control sees a clean query and clean text context. The payload was delivered through the image layer of a document that entered the corpus at ingestion time, days or weeks before the inference request that triggered it.

This is the canonical LLM03 indirect poisoning scenario for multimodal RAG: the attack happens at ingestion time, the effect is at inference time, and the persistence is indefinite — every subsequent retrieval that surfaces the poisoned document delivers the payload fresh. For a discussion of the inference-time image PI vector in RAG pipelines, see the RAG pipeline prompt-injection scanner page, which covers the inference-time complement to the pre-ingestion control described here.

The multimodal attack surface in fine-tuning datasets

Organizations that fine-tune or instruction-tune multimodal foundation models on proprietary datasets face the second LLM03 vector. Fine-tuning datasets for multimodal models consist of image-text pairs: images paired with captions, instructions, question-answer pairs, or preference labels. The image content is what the fine-tuned model learns to reason about; the text annotations guide the learning objective.

Text-only data validation applied to a fine-tuning dataset inspects the text annotation side of each pair. A text PI scanner run over the instruction column or the response column of the training set will flag text annotations that contain adversarial instructions. It will not inspect the image content of each image-text pair. An adversarial trigger — a specific pattern of pixels that causes the fine-tuned model to emit a target output, an embedded instruction rendered as typography in the image — present in a training image will survive text-only validation and enter the model's training pass. The model learns an association between the pixel trigger and the desired output; in production, the trigger can be included in any user-submitted image to reliably elicit the trained adversarial response.

The attack classes relevant here are the same as in the inference-time context — FigStep-style typographic triggers, adversarial perturbations that exploit the vision encoder's learned features, steganographic payloads — but the threat model differs from inference-time PI. At inference time, the attacker needs to get a malicious image in front of the model in the current conversation. In the fine-tuning poisoning scenario, the attacker needs to get a malicious image into the training dataset, after which the attack is baked into the model's learned weights and does not require any per-request image delivery. The persistent capability is worth more to an attacker and is harder to remediate: removing the poisoning effect requires re-identifying the poisoned training examples and retraining or further-fine-tuning to overwrite the learned association.

Practical fine-tuning dataset sources that carry this risk include: user-generated content used for instruction following (avatar SaaS products that fine-tune on user-submitted selfies and preference labels; document understanding models fine-tuned on scanned contracts submitted by customers); third-party image datasets assembled from web crawls or image-sharing platforms; PDF corpora used for document understanding fine-tuning; and synthetic image-text pairs generated by other multimodal models that may themselves have been fine-tuned on poisoned data. For product types whose ingestion patterns are covered by other pages in this cluster, see prompt injection for avatar SaaS and for chatbots with image upload.

Why text validators and OCR-before-scan cannot close the multimodal LLM03 gap

The two most common responses to the multimodal dataset validation gap are to run a text PI scanner over the text layer of each ingested document and to insert an OCR step before the text scanner to extract text from embedded images before scanning. Neither closes the LLM03 multimodal gap. The reasons are architectural and parallel the LLM01 inference-time OCR argument, applied to the dataset-ingestion context.

The text validator limitation is straightforward: a text PI scanner receives text strings. In a RAG ingestion pipeline, text strings come from the text extraction layer of the document loader. The document loader extracts the text layer of a PDF page — the characters encoded in the PDF's content stream — and discards or separately handles the binary image objects embedded in that page. The text PI scanner never receives image bytes. Even if an attacker embeds an adversarial instruction as rendered typography in an image, that content is not in the text extraction output unless OCR is run. The text scanner has no coverage path to the pixel content of embedded images.

The OCR limitation is more subtle. Running Tesseract or an equivalent OCR engine over embedded images produces a text transcript of the visible characters in those images. A text PI scanner applied to the OCR transcript provides partial coverage: it catches adversarial instructions that are rendered clearly enough for OCR to transcribe accurately. The FigStep and AgentTypo attack families are specifically designed to defeat this pipeline. They render adversarial instructions using anti-OCR fonts, adversarial glyph distortions, Unicode confusable characters, or rendering techniques that exploit the gap between what a neural vision encoder reads from pixel values and what an OCR character recognition system extracts from the same image. A FigStep-class payload produces a clean or misleading OCR transcript while the vision encoder reads the embedded instruction directly from the pixel layer. The OCR transcript scan scores the image as clean; the image enters the corpus or training set; the vision encoder processes the payload at inference or training time.

The structural ceiling here is the same as at inference time: OCR is a character recognition system that extracts a text representation of image content. It is not an adversarial payload detector. It does not inspect the image bytes for adversarial-instruction patterns. It does not model what a neural vision encoder will see when it processes the same pixel values. A scanner that inspects the image bytes directly — against a curated corpus of known-malicious multimodal payloads — is the only control architecture that closes the gap. For the detailed technical argument on the OCR structural ceiling, see Why every text-only prompt-injection scanner misses a 30-pixel PNG and FigStep detection.

Coverage matrix against the LLM03:2025 multimodal evidence requirement

Mapping the self-serve prompt-injection scanner landscape to the LLM03:2025 dataset poisoning evidence requirement for multimodal AI systems produces a gap that runs uniformly across all text-only tools.

Tool	Text content in ingested docs	Image layers in ingested PDFs	Image-text fine-tuning pairs	Audio in training corpora	Pre-ingestion scan record
Lakera Guard	Yes — text PI scoring	No (text-only)	Text annotations only	No	Text layer only
LLM Guard (OSS)	Yes — text scanners	No (text-only by design)	Text annotations only	No	Text layer only
Azure Prompt Shields	Yes (Azure-gated)	Content moderation, not PI	Text annotations only	No	Content moderation only
Promptfoo	Eval-time test harness	Not designed for ingestion gates	Not designed for ingestion gates	Not designed for ingestion gates	Not a dataset provenance record
Glyphward	Run-both with text scanner	Yes — per-image scan + scan_id	Yes — per-image scan + scan_id	Yes — per-audio scan + scan_id	All modalities; immutable scan_id

Three clarifications apply. First, Azure Prompt Shields' image analysis covers content moderation categories — violence, CSAM, hate speech — not prompt-injection payload detection. A content moderation "no harmful content" result for an embedded PDF image does not constitute LLM03 dataset validation for adversarial-instruction payloads; the two functions have different threat models. The Azure Prompt Shields comparison page covers this distinction. Second, Promptfoo is a pre-deployment evaluation harness that tests model behaviour against adversarial probe sets. It is not designed to serve as a dataset ingestion gate that scans documents before they enter a retrieval corpus; its test records are CI artefacts, not per-document dataset provenance records. Third, Lakera Guard was acquired by Check Point in 2025 and is moving upmarket; the Lakera alternative page tracks the practical availability implications. For RAG pipelines already using LLM Guard on the text path, the run-both pattern is the recommended addition; see LLM Guard alternative (multimodal).

Five-step LLM03-aligned architecture for multimodal dataset security

Implementing LLM03:2025-compliant dataset validation for a multimodal AI application requires five steps that map onto the OWASP LLM03 control objectives: data source integrity, ingestion-time validation, provenance records, quarantine workflow, and continuous corpus review.

Inventory every data source feeding fine-tuning sets and RAG corpora, and identify which sources contain or may contain image or audio content. OWASP LLM03 begins with data source visibility: you cannot validate what you have not enumerated. For each data source — internal document stores, third-party data feeds, web crawlers, user-submitted uploads, exported CRM or ticketing records — determine whether the documents that source produces contain image layers (PDFs with figures, PPTX files, DOCX with embedded images) or audio content (transcribed call recordings paired with audio files). Mark each source as text-only, image-bearing, or audio-bearing. Sources marked image-bearing or audio-bearing have a multimodal LLM03 attack surface that requires pre-ingestion scanning; text-only sources are adequately covered by your existing text validator. A crawled web source should be treated as image-bearing by default: modern web pages routinely contain images that a scraper captures alongside text. Per-product data source inventories are available for RAG pipelines, MCP servers, and chatbots with image upload.
Place the pre-ingestion multimodal scan between document receipt and corpus entry — before any chunk enters the vector store or training dataset. The scan must be positioned before ingestion, not after. A document that enters the vector store and is later found to contain a malicious image layer has already been indexed; every retrieval that surfaces that document from ingestion to detection has exposed the model to the payload. The correct control placement is at the ingestion gate: extract all embedded images from the document (using a PDF image extractor such as PyMuPDF or pdfplumber, or the equivalent for PPTX and DOCX), post each extracted image to the Glyphward scan endpoint, receive the risk score, and decide — allow ingestion, flag for human review, or quarantine — before the document enters the corpus. For fine-tuning datasets, the scan gate sits before the training run: scan each image in the dataset batch before submitting the batch to the fine-tuning pipeline. Documents or training examples that clear the scan enter the corpus or dataset with a linked scan_id; those that do not are routed to the quarantine workflow in step 4. For LangChain, LlamaIndex, and CrewAI implementations, this is a middleware step in the document loading pipeline.
Write per-document scan records as the LLM03 dataset-provenance evidence. LLM03 controls require that organisations can demonstrate the integrity of their training and retrieval data. The pre-ingestion scan record is the concrete implementation of that requirement for multimodal content. For each document ingested into the RAG corpus, store: document_id (or chunk_id), source_uri, ingestion_timestamp, and — for each image extracted from the document — the Glyphward scan_id, the image_hash (for deduplication and audit), the risk_score, and the ingestion_action (allowed, flagged, quarantined). For fine-tuning datasets, store the same fields at the training-example level with a dataset_version tag. These records constitute the dataset provenance trail that an LLM03 control review or a SOC 2 / ISO 27001 audit can inspect to confirm that multimodal content was validated before influencing model behavior. The scan_id links the application-side provenance record to the immutable server-side scan record at Glyphward, providing a corroborating reference that cannot be retroactively modified. The same per-request evidence structure that the SOC 2 CC6.6 evidence pattern and the ISO 27001 A.8.28 evidence pattern describe at inference time applies here at ingestion time.
Implement a quarantine and review workflow for documents that exceed the risk threshold. A pre-ingestion scan that flags a document as high-risk requires a disposition decision: the document either does not enter the corpus, enters a quarantine store pending human review, or is forwarded to a sanitisation step (image removal or redaction) before re-ingestion. For automated pipelines ingesting high volumes of documents — web crawlers, bulk CRM exports — the quarantine workflow must be designed for throughput: a high-risk document triggers a webhook or queue entry that routes to an alert channel (Slack, PagerDuty, SIEM) while the document is held in a staging store. Human reviewers examine quarantined documents against the flagged image regions and risk reasons provided in the scan response; they approve clean documents for ingestion and discard confirmed poisoned documents. For fine-tuning datasets, quarantined training examples are excluded from the training batch and logged with the dataset_version; if the quarantine rate for a source exceeds a threshold (for example, more than 0.1% of a dataset batch), the entire source is suspended for investigation. The quarantine scan_id is the evidence record that demonstrates the LLM03 control functioned for flagged items — it shows the system identified and held the document rather than ingesting it unchecked.
Run periodic re-scans of the existing corpus as the adversarial payload corpus is updated. LLM03:2025's dataset security requirement is ongoing, not one-time. A document that passed the scan at ingestion time — because the attack vector it contained was not yet in the payload corpus — may fail a re-scan after the corpus is updated with the new attack family. Glyphward continuously expands the curated payload corpus as new multimodal PI attack variants are published and observed. A quarterly or monthly re-scan of the retrieval corpus — scanning each embedded image against the current payload corpus and quarantining documents whose risk scores have risen above threshold — ensures that historical ingestion decisions remain valid under the current threat model. The re-scan record updates the provenance trail: each re-scanned image receives a new scan_id, timestamp, and score; the document's provenance entry is updated to reflect the most recent scan verdict. For compliance programmes that align with the NIST AI RMF Measure and Manage functions or the MITRE ATLAS threat-intelligence update cycle, the periodic re-scan is the implementation of continuous monitoring for the dataset attack surface.

How Glyphward fits

Glyphward is designed primarily as an inference-time scanner — bytes in, 0–100 risk score and flagged regions out, <200 ms p95 — but the API contract is equally suited to the pre-ingestion use case that LLM03:2025 requires. A POST to the image scan endpoint with an extracted image from a PDF returns the same scan_id, risk_score, flagged_region, and modality_reason that an inference-time call returns. The scan runs the same detection pipeline: CLIP embedding plus a typographic-PI detection head plus Tesseract OCR cross-referenced against a curated adversarial payload corpus. A pre-ingestion call to this endpoint before a document enters the retrieval corpus produces a provenance record that is structurally identical to an inference-time scan record — the same evidence format that satisfies SOC 2 CC6.6 at inference time satisfies the LLM03 dataset-integrity requirement at ingestion time.

For fine-tuning dataset validation, the same endpoint accepts images from training pairs before the training batch is submitted. The scan_id persists on the Glyphward server as an immutable record linked to the submitted image hash; when auditors or MLSec reviewers ask for evidence that image X in dataset version Y was validated before training, the scan_id and server-side record provide that evidence. The free tier supports 10 scans per day — enough for development and corpus-sampling experiments; the Pro tier at $29/mo provides 100k scans per month suitable for continuous ingestion pipelines. Full pricing is at the pricing page.

Glyphward does not replace the text-side PI scanner already in your ingestion pipeline. The run-both pattern applies at ingestion just as it does at inference time: the text scanner covers the text content of ingested documents; Glyphward covers the image and audio layers. Together they close the LLM03 multimodal gap without replacing existing text-side controls.

Get early access · See the API surface · RAG pipeline integration

Related questions

Does OWASP LLM03:2025 specifically cover RAG corpus poisoning, or only model pre-training?

The 2025 revision of OWASP LLM03 explicitly covers fine-tuning datasets, RLHF preference data, retrieval corpora, and any data source that shapes how the deployed LLM application behaves — not only the web-scale corpora used to pre-train foundational models. RAG corpus poisoning is among the most practically relevant LLM03 vectors for AI application developers because it operates at the application layer, not the model-vendor layer. A developer who uses a third-party foundation model (GPT-4o, Claude 3.5, Gemini 1.5) has no control over the pre-training data; they do have full control over the retrieval corpus their RAG pipeline queries. LLM03 dataset security for the retrieval corpus is therefore a developer responsibility, and the multimodal image-layer attack surface within that corpus is the gap that no text-only ingestion validator closes. For the inference-time view of the same threat — where a retrieved image-bearing document delivers its payload in the current conversation — see the RAG pipeline prompt-injection scanner page.

Do we need to re-scan documents already in our knowledge base, or only new ingestions?

Both. Documents already in your retrieval corpus were ingested at a point in time when the adversarial payload corpus available for scanning was smaller than it is today. A payload family that was not in the corpus at ingestion time — and would therefore have received a low risk score — may be detectable today if it has since been added. A periodic re-scan of the existing corpus is the practical implementation of LLM03's ongoing dataset integrity requirement. For most production RAG systems, a full re-scan of the image layers in the corpus is feasible at a monthly or quarterly cadence using the Glyphward API's batch throughput. The re-scan should be prioritised for documents from external or third-party sources (web crawls, third-party data feeds, user-uploaded content) before internally generated documents, since external sources have the highest adversarial exposure. Documents that fail a re-scan are quarantined from the active retrieval index until reviewed; the updated scan record replaces the original provenance entry with a current scan_id. Continuous ingestion pipelines should scan new documents at ingest time; the periodic re-scan covers the historical corpus.

How does LLM03 Training Data Poisoning differ from LLM01 Prompt Injection for multimodal applications?

The distinction is where the adversarial content enters the system and how it persists. In the LLM01 Prompt Injection scenario, an attacker submits a malicious image in the current conversation — the image arrives through user input in the active inference request, and removing it from the conversation removes the attack. In the LLM03 Training Data Poisoning scenario, an attacker introduces a malicious image into the retrieval corpus or fine-tuning dataset at some prior point in time — the image persists in the corpus indefinitely and delivers its payload to every subsequent inference that retrieves that document, without any per-request action by the attacker. The stealthiness and persistence of LLM03 vectors are what distinguish them from LLM01 direct injection: a user-initiated conversation query contains no adversarial content, every text-side inference control sees a clean query, and the payload arrives through retrieved context that the model treats as trusted. The two attack classes complement each other, and the multimodal scanner addresses both: at inference time for LLM01 (scan every image in the current multimodal request before the model call), and at ingestion time for LLM03 (scan every image in each document before it enters the corpus). See OWASP LLM01:2025 prompt injection — multimodal for the inference-time case and OWASP LLM02:2025 Insecure Output Handling for the downstream execution risk when either vector succeeds.

Can a text chunker plus text PI scanner substitute for multimodal pre-ingestion scanning?

No, for the same structural reason that OCR-before-text-scan does not satisfy the inference-time PI control requirement. A text chunker extracts the text layer of a document and produces text strings. A text PI scanner applied to those strings inspects text content. Neither step inspects the image bytes of images embedded in the document. Even if an OCR step is added — extracting text from embedded images before scanning — the FigStep and AgentTypo attack families are explicitly designed to defeat OCR-before-text-scanner pipelines by using anti-OCR rendering techniques that produce a clean OCR transcript while preserving the adversarial instruction in the pixel layer that the neural vision encoder reads. The OCR transcript scan produces a per-image clean record; the image enters the corpus with a passing validation entry; the vision encoder processes the payload on retrieval. The scan must operate on the image bytes themselves, against a model that understands what a neural vision encoder will see, not on a derived text representation. This is the core architectural difference between a text PI scanner and a multimodal PI scanner: the scanning model must process the same representation the target model processes, which for image inputs is the pixel values, not a character-recognition transcript. See FigStep detection and Why every text-only scanner misses a 30-pixel PNG for the detailed argument.

Which other OWASP LLM 2025 risks intersect with multimodal dataset poisoning?

Three OWASP LLM 2025 categories have direct intersections with the multimodal LLM03 dataset attack surface. First, LLM01 Prompt Injection: as described above, the LLM03 corpus-poisoning vector is the persistent-retrieval delivery mechanism for what is architecturally an LLM01 payload — the same adversarial instruction reaches the model, but through retrieved context rather than current user input. Second, LLM02 Insecure Output Handling: when a multimodal PI payload delivered through a poisoned retrieved document causes the model to emit output that flows to a code interpreter, Markdown renderer, or downstream API client, the resulting execution vulnerability is an LLM02 insecure output handling event whose root cause is an LLM03 corpus-poisoning event; see OWASP LLM02:2025 multimodal. Third, supply chain risks: a third-party data feed, a community-maintained knowledge base, or a model fine-tuned by a third party and accessed via API are all supply chain inputs that carry LLM03 dataset poisoning risk. The pre-ingestion scan applies to all third-party data sources that enter the retrieval corpus, regardless of the source's provenance reputation. Together, the LLM01, LLM02, and LLM03 pages in this cluster cover the full input-to-output attack chain for multimodal AI applications — from adversarial content entering the system (LLM01 or LLM03) through model output reaching downstream execution (LLM02).