ICP-by-product · Voice agents

Prompt-injection scanner for voice agents

A voice agent does not act on a transcript. It acts on a waveform that was decoded into a transcript and forwarded to a tool-using LLM. Anything your defender misses on the way in lands in the model. WhisperInject and the broader audio prompt-injection class hide instructions in exactly the bytes the STT throws away — which means a transcript-only filter is structurally one step too late.

TL;DR

If your voice agent has a microphone, a phone line, or a file upload, you have an audio prompt-injection surface. The defence has to inspect the waveform itself, in parallel with whatever transcript filter you already run. A drop-in scanner that returns a 0–100 risk score on raw audio adds tens of milliseconds and closes the gap that picking a “better” STT cannot.

The voice-agent attack surface, in order

A typical production voice agent looks like this: microphone → STT (Whisper, Deepgram, AWS Transcribe, Gemini audio) → text prompt-injection filter → LLM with tool access → action. Every arrow is a place for the defender to inspect, and every box is a place for the attacker to hide. The structural problem is that the only inspection most teams run sits at arrow three — between the STT and the LLM — and the STT in box two is a lossy compressor.

STTs trade signal fidelity for transcript cleanliness. Sample rates are downsampled to 16 kHz, bands above and below the speech range are filtered out, voice-activity detection truncates inter-word silences, and beam search smooths low-confidence tokens into something readable. Each step is a filter. Each filter is something the attacker can target. By the time the transcript reaches your text PI scanner, the signal you wanted to inspect has already been removed.

What a transcript filter sees, and what it cannot see

Transcript-side filters are necessary. They catch the easy cases: someone reading a jailbreak aloud, an explicit system-prompt-override the STT cleanly transcribed, policy-violating content in the spoken request. Use them. They are cheap, fast, and stop the lazy half of the attack space.

What they cannot see is the half hidden in the bytes. WhisperInject — described in arXiv:2405.20653 — places instructions into spectrum bands or temporal locations that the STT discards before transcription. Out-of-band carriers, silence steganography, adversarial waveform perturbations, and quiet multi-speaker overlays all share the same structural property: the artefact your filter reads (a transcript) and the artefact the model acts on (the same transcript, or in the case of audio-LLMs, the raw bytes themselves) are no longer the same thing. You are defending the wrong artefact. See WhisperInject detection and audio prompt-injection detection for the technical depth on each subtype.

A scanner that runs before — and alongside — the STT

The architecture that closes the gap is a pre-STT inspection running on the raw waveform, in parallel with the transcript-side filter you already have. Two signals, both required:

Waveform anomaly classifier. A small convolutional model over the full-band spectrogram, trained on out-of-band energy patterns, adversarial-perturbation artefacts, silence steganography, and below-noise-floor multi-speaker overlays. Returns a score independent of the transcript.
Transcript-side PI filter. Standard text PI scanner over whatever your STT outputs — keep what you have. The waveform classifier is additive, not a replacement.

Either signal above threshold is cause to block, route to a human reviewer, or downgrade the agent’s privileges (read-only mode, no tool calls, no payments). Run them in parallel, not in series — the transcript is the feature the waveform classifier is defending against, so passing it through the classifier gains nothing.

Integration patterns for real-time voice

Three patterns dominate, depending on how interactive your voice agent is.

Async batch (telephony post-call review). The waveform is captured, then scanned alongside transcription. Latency is irrelevant. Use the full-fidelity scan and surface flagged calls to a human reviewer queue. This is the cheapest first integration; it generates a labelled corpus that you can then use to tune real-time thresholds.
Pre-LLM gate (most production voice agents). Run the scanner on each captured utterance before the LLM call. Block or downgrade in-flight if a high-risk score lands. Adds latency to the first token of the response.
Streaming co-inspection (low-latency voice). Run the waveform classifier on rolling 250 ms windows in parallel with STT, surface a running risk score, and short-circuit the LLM call if any window crosses the block threshold. Adds no perceptible latency on a clean path; cuts off the model on a hot path before it can act.

None of these change your STT choice. The scanner sits beside the STT, not in front of it.

Latency budget

The waveform classifier in Glyphward’s production pipeline runs in tens of milliseconds for a typical 1–5 second utterance on commodity inference hardware — well inside the latency envelope of a voice agent that already accepts STT round-trip. If your pipeline already runs Whisper-small for transcription, you can share the model output and the marginal cost on top is the waveform path alone. Public benchmarks land in the API docs at GA. The free tier lets you run the same calls against the public WhisperInject sample set today.

How Glyphward fits a voice-agent stack

Glyphward’s `/v1/scan` endpoint accepts a waveform — raw bytes, WAV, or common containers — and returns a 0–100 risk score, the modality flag, the classifier confidence, and a flagged timestamp range. Drop it in front of your LLM call, behind your STT, or in parallel with both — the response shape does not change. Free tier: 10 scans a day, no card. Pro: 100,000 scans/month at $29. Team: 1,000,000 at $99 with audit log. See the full pricing page or the comparison vs Lakera, LLM Guard, Azure Prompt Shields and Promptfoo. Among self-serve scanners under $100/mo, Glyphward is currently the only one with a production audio pipeline.

Get early access