Blog · Engineering deep-dive · 2026-04-30

Building a prompt-injection scanner for voice agents: what Whisper drops, and why it matters

A voice agent acts on what an STT decided your audio meant. The attacker writes the bytes. Between those two facts sits a stack of audio-engineering filters — band limits, sample-rate downsampling, voice-activity detection, beam-search smoothing — that exists to make speech recognition cleaner. Each filter is also a place a payload can hide. This post is the engineering side of building a scanner that runs before those filters fire, so the inspection sees the same bytes the attacker wrote.

TL;DR

Speech-to-text systems are lossy compressors with quality goals — clean transcripts — that conflict with the goals of a security inspection. By the time text reaches your prompt-injection filter, the bands and timings the audio prompt-injection payload was hiding in have already been filtered away. The defender's move is a small CNN over a full-band log-mel spectrogram of the raw waveform, run in parallel with whatever transcript filter you already use. It adds tens of milliseconds, closes the WhisperInject-class gap, and the corpus of flagged samples compounds. This post walks through the four audio-PI subtypes, the build steps, the trade-offs, and the limitations we are still working through. Companion content: the architectural argument is in why every text-only PI scanner misses a 30-pixel PNG; the integration patterns at the product level are on the voice-agent ICP page; the threat model overview is in the 2026 multimodal PI threat model.

1. The artefact mismatch you cannot reason your way around

Every defender has a contract with the artefact it inspects. A text PI scanner reads strings; its threat model is the set of attacks that are visible in strings. That contract is fine when the model also acts on a string. It breaks the moment the model acts on something a string was decoded from. In a voice-agent pipeline, that is exactly what happens. The transcript is a decoded artefact, not the source artefact, and the decoder is tuned for transcript quality — not for preserving evidence of an attack.

Concretely, the typical pipeline runs microphone → STT (Whisper, Deepgram, AWS Transcribe, Gemini Audio) → text PI filter → LLM with tool access → action. The text filter at step three is fine; it stops people who read a jailbreak aloud, the casual half of the attack surface. What it cannot stop is the half hidden in steps one and two. That is not a quality problem with the filter. It is a contract problem: the filter is reading the wrong artefact. This is the same architectural pattern as the image side — a text scanner reading the OCR output of a FigStep image is also defending the wrong artefact, and we wrote up that argument in why every text-only scanner misses a 30-pixel PNG. The voice case is the audio mirror of it.

The build implication is brutal in its simplicity. You cannot fix this by picking a better STT or by writing a smarter text filter on top. You have to inspect the source artefact — the waveform — before any STT touches it. Everything else in this post is about how to do that without breaking your latency budget, your cost model, or your STT choice.

2. The four audio-PI subtypes, at the byte level

Audio prompt injection is not one attack; it is a class with at least four documented subtypes, each hiding in a different region of the byte stream that the STT is structurally incentivised to discard.

Out-of-band carrier injection. The payload is rendered into spectrum bands above or below the speech range — typically >14 kHz or <200 Hz. STTs band-pass-filter to roughly 300–8000 Hz before the acoustic model runs, because anything outside that range is not human speech. WhisperInject, described in arXiv:2405.20653, is the canonical case: ultrasonic instructions audible to the model's pre-decoder front-end on some hardware paths but invisible in the transcript. Detection signal: spectral energy concentration outside the speech band that does not match expected room-noise statistics.

Silence steganography. The payload is encoded into the inter-word silences — the gaps the STT's voice-activity detector trims before the acoustic model runs. A 200 ms silence between two words can carry hundreds of milliseconds of low-amplitude carrier signal that the VAD treats as noise. Detection signal: non-trivial low-frequency or modulated energy in regions classified as silence by a parallel VAD pass.

Adversarial waveform perturbation. The payload is mixed into the speech itself as a low-amplitude perturbation crafted to make the STT mis-transcribe in a specific direction. The transcript reads as the attacker wanted, but the underlying waveform deviates from natural speech statistics in measurable ways. Detection signal: residual after subtracting the model-predicted clean waveform from the observed one — well-studied in adversarial-audio literature.

Multi-speaker overlay. A second speaker reads the payload at low amplitude under the primary speaker. Beam search in the STT will usually decode only the dominant track, leaving the payload spoken by the secondary track untranscribed — but an audio-LLM that ingests the waveform directly may attend to both. Detection signal: speaker-diarisation output indicating more than one speaker in segments where the transcript shows only one. We cover the broader class in audio prompt-injection detection with each subtype's signature broken out.

3. The build: a four-stage pipeline you can wire in two weeks

The architecture that closes the gap is not exotic. It is a standard supervised-learning pipeline with one important constraint: every stage has to operate on bytes the STT has not yet touched. Once the STT has seen the audio, the evidence is gone.

Stage 1 — full-fidelity decode. Capture the audio bytes upstream of the STT and decode at the source sample rate (typically 44.1 or 48 kHz, sometimes 22.05). The single most common mistake in a first build is to decode at 16 kHz because that is what the STT wants. At 16 kHz you have already thrown away every signal above 8 kHz, which is exactly where out-of-band carrier injection lives. Decode at the source rate; downsample only for the STT path, never for the inspection path.

Stage 2 — full-band feature extraction. Compute a log-mel spectrogram with 128 mels across the full 0–24 kHz range. Speech-tuned pipelines use 80 mels across 0–8 kHz, which collapses the bands you actually need to inspect into a handful of coarse bins. Use a 25 ms window with 10 ms hop; this is small enough to localise short-burst payloads and large enough that the spectrogram resolution is meaningful for sub-300 Hz signals.

Stage 3 — small CNN classifier. A four-block CNN with around two million parameters trained on a labelled corpus of clean speech, out-of-band carriers, silence-steganography samples, adversarial perturbations, and multi-speaker overlays gets to >90% recall on each subtype with sensible precision. Bigger models are tempting but cost real latency on a streaming voice path; the per-class signal is loud enough that a small classifier saturates quickly. We hold back the architecture details for the production model, but the gist is in the public benchmarks we plan to publish at GA.

Stage 4 — corpus indexing. Every flagged utterance enters a labelled corpus indexed by subtype, signature features, and source. The corpus is the long-term moat. A scanner with a corpus of ten thousand labelled samples per subtype outperforms a scanner with the same architecture and a hundred samples; the gap widens as the corpus grows. This is the same compounding-data thesis that applies to malware sandboxes and email-spam filters, and it is the reason a managed scanner with a shared corpus tends to beat a fresh in-house build at the per-customer level even when the architecture is identical.

4. Trade-offs we made (and didn't)

The honest part of any architecture write-up is the trade-offs. Several of ours surprised us; some are in the public version of this stack and some are not.

We chose CNN over transformer. A small audio transformer would in principle generalise better, especially across out-of-distribution carrier patterns. In practice, the inference latency of a 2 M-param CNN against a 50 M-param audio transformer is the difference between adding 20 ms and adding 200 ms on commodity GPUs, and 200 ms inside a streaming voice agent is the latency budget for the entire response. The CNN-plus-feature-engineering approach is uglier but ships.

We chose run-both over replace. The waveform classifier does not replace your transcript-side filter; it runs alongside it, and either signal above threshold blocks. We were tempted to position this as a unified scanner that could deprecate the text filter — that would have been the better marketing story. The honest answer is that the two filters cover different attack surfaces and you want both. The architectural read on this is also on Glyphward vs LLM Guard and vs Lakera Guard; both pages walk through the run-both pattern with concrete integration sketches.

We did not chain. A natural-looking architecture is text-filter-first, fall through to waveform-classifier-on-flagged-only. This saves compute on clean traffic. It also re-introduces the problem we set out to solve: the text filter does not know what it does not see, so a payload that the text filter reads as benign is exactly the case where the waveform classifier is needed and gets skipped. Both filters run on every utterance, in parallel, no exceptions.

We deferred the full-attention model. Multi-speaker overlay is the subtype where a model with cross-track attention would help most, because the whole signal is in the relationship between two tracks. We kept it on the roadmap and shipped the four-block CNN first, because shipping the worse classifier today and improving against a real corpus beats shipping the better classifier in six months against a synthetic corpus.

5. What still doesn't work, in plain language

No security scanner is complete; the honest version is to say what is still uncovered. Three holes are open in this stack today. First, very-low-amplitude adversarial perturbations crafted against the specific classifier weights — the white-box adversarial-robustness problem — still bypass at higher rates than we want. Defence is a moving target; we ship classifier updates monthly and rely on the corpus compounding. Second, deepfake voice cloning is out of scope by design — it answers a different question (is this the right person speaking?) and we do not pretend the same model covers both. Pair us with an authenticity-focused detector if you need both layers. Third, the streaming co-inspection pattern (rolling 250 ms windows) trades a small amount of recall on long-horizon payloads for sub-100 ms latency; for telephony post-call review, the async batch pattern with full-fidelity scan recovers that recall. The trade-off is documented on the voice-agent integration page.

The fourth thing worth saying: the corpus is the slowest-moving part of this whole architecture. Models we can ship in days; corpora take months. Anyone building this from scratch will spend the bulk of their time on the corpus and the test rig that scores it, not on the classifier. Plan accordingly.

6. Where this fits in your stack

Glyphward's /v1/scan endpoint accepts a waveform — raw bytes, WAV, or a common container — and returns a 0–100 risk score, the modality flag, the classifier confidence per subtype, and a flagged timestamp range. The contract is the same one the image path returns; the response shape does not change between modalities. Drop it in front of your LLM call, behind your STT, or in parallel with both. Free tier covers 10 scans a day with no card; $29 covers 100,000 scans a month, and the comparison matrix covers how that lands against Lakera Guard, LLM Guard, Azure Prompt Shields, and Promptfoo. As of this writing, Glyphward's audio path is the only self-serve scanner under $100/mo with a production audio pipeline; the category overview is on multimodal LLM security API. If you would rather see the engine running before you wire the API, the embed widget preview mounts a working demo on the page.

The argument of this post is not that you should buy our scanner. It is that the build is straightforward, the trade-offs are well-understood, and the longer you wait to defend the waveform the longer your voice agent is acting on the artefact the attacker controls. Build it, integrate it, or run our API — but do not assume your transcript-side filter is enough.

FAQ

Why can't I just run my text PI scanner over the STT output and call it done?

Because the STT is a lossy compressor that discards the bands where audio prompt-injection payloads live. By the time the transcript reaches your scanner, the carrier signal — sub-300 Hz, ultrasonic, in-silence, or adversarial-perturbation — has been filtered out. You are inspecting the wrong artefact. The transcript-side filter is still useful for the lazy half of the attack space (people reading a jailbreak aloud), but it cannot cover the half hidden in the bytes.

How much latency does a waveform classifier add to a real-time voice agent?

On a typical 1–5 second utterance and commodity inference hardware, a 2 M-parameter CNN over a log-mel spectrogram completes in tens of milliseconds. That is well inside the latency envelope of any pipeline that already accepts an STT round-trip. For sub-100 ms streaming voice, the rolling-window co-inspection pattern (run the classifier on 250 ms windows in parallel with STT, short-circuit if any window crosses the block threshold) adds no perceptible latency on a clean path.

Do I need to rebuild this from scratch, or can I integrate an existing scanner?

Build-vs-buy here is the same calculus as for text PI scanners. The pipeline is well-documented enough to build, but the labelled corpus is the part that compounds — and a fresh build starts with no corpus. A managed scanner gets you the corpus on day one for the price of an API call. Glyphward's audio path is the only self-serve option under $100/mo today, but the architectural argument in this post applies regardless of whether you call our API or write your own.

What about audio-LLMs that consume the waveform directly without STT?

The argument gets stronger, not weaker. Audio-LLMs like Gemini Audio or GPT-4o Audio act on the waveform itself, so the transcript-side filter is bypassed entirely — there is no transcript stage. A waveform-side scanner is no longer additive; it is the only inspection point you have. Run the classifier on the bytes before the audio-LLM call, with the same threshold contract as for STT-based stacks.

How do you stop attackers from adversarially evolving past the classifier?

You don't, in the absolute sense — this is an arms race like every other classifier-based defence. What you do is shorten the loop between attack and detection: every blocked utterance feeds the corpus, every corpus refresh tightens the threshold, every release ships a new public benchmark. The asymmetry favours the defender once the corpus crosses a few thousand labelled samples per subtype, because the attacker must now produce a payload that bypasses every prior known variant simultaneously.

Further reading