ICP-by-product · Voice agents
Prompt-injection scanner for voice agents
A voice agent does not act on a transcript. It acts on a waveform that was decoded into a transcript and forwarded to a tool-using LLM. Anything your defender misses on the way in lands in the model. WhisperInject and the broader audio prompt-injection class hide instructions in exactly the bytes the STT throws away — which means a transcript-only filter is structurally one step too late.
TL;DR
If your voice agent has a microphone, a phone line, or a file upload, you have an audio prompt-injection surface. The defence has to inspect the waveform itself, in parallel with whatever transcript filter you already run. A drop-in scanner that returns a 0–100 risk score on raw audio adds tens of milliseconds and closes the gap that picking a “better” STT cannot.
The voice-agent attack surface, in order
A typical production voice agent looks like this: microphone → STT (Whisper, Deepgram, AWS Transcribe, Gemini audio) → text prompt-injection filter → LLM with tool access → action. Every arrow is a place for the defender to inspect, and every box is a place for the attacker to hide. The structural problem is that the only inspection most teams run sits at arrow three — between the STT and the LLM — and the STT in box two is a lossy compressor.
STTs trade signal fidelity for transcript cleanliness. Sample rates are downsampled to 16 kHz, bands above and below the speech range are filtered out, voice-activity detection truncates inter-word silences, and beam search smooths low-confidence tokens into something readable. Each step is a filter. Each filter is something the attacker can target. By the time the transcript reaches your text PI scanner, the signal you wanted to inspect has already been removed.
What a transcript filter sees, and what it cannot see
Transcript-side filters are necessary. They catch the easy cases: someone reading a jailbreak aloud, an explicit system-prompt-override the STT cleanly transcribed, policy-violating content in the spoken request. Use them. They are cheap, fast, and stop the lazy half of the attack space.
What they cannot see is the half hidden in the bytes. WhisperInject — described in arXiv:2405.20653 — places instructions into spectrum bands or temporal locations that the STT discards before transcription. Out-of-band carriers, silence steganography, adversarial waveform perturbations, and quiet multi-speaker overlays all share the same structural property: the artefact your filter reads (a transcript) and the artefact the model acts on (the same transcript, or in the case of audio-LLMs, the raw bytes themselves) are no longer the same thing. You are defending the wrong artefact. See WhisperInject detection and audio prompt-injection detection for the technical depth on each subtype.
A scanner that runs before — and alongside — the STT
The architecture that closes the gap is a pre-STT inspection running on the raw waveform, in parallel with the transcript-side filter you already have. Two signals, both required:
- Waveform anomaly classifier. A small convolutional model over the full-band spectrogram, trained on out-of-band energy patterns, adversarial-perturbation artefacts, silence steganography, and below-noise-floor multi-speaker overlays. Returns a score independent of the transcript.
- Transcript-side PI filter. Standard text PI scanner over whatever your STT outputs — keep what you have. The waveform classifier is additive, not a replacement.
Either signal above threshold is cause to block, route to a human reviewer, or downgrade the agent’s privileges (read-only mode, no tool calls, no payments). Run them in parallel, not in series — the transcript is the feature the waveform classifier is defending against, so passing it through the classifier gains nothing.
Integration patterns for real-time voice
Three patterns dominate, depending on how interactive your voice agent is.
- Async batch (telephony post-call review). The waveform is captured, then scanned alongside transcription. Latency is irrelevant. Use the full-fidelity scan and surface flagged calls to a human reviewer queue. This is the cheapest first integration; it generates a labelled corpus that you can then use to tune real-time thresholds.
- Pre-LLM gate (most production voice agents). Run the scanner on each captured utterance before the LLM call. Block or downgrade in-flight if a high-risk score lands. Adds latency to the first token of the response.
- Streaming co-inspection (low-latency voice). Run the waveform classifier on rolling 250 ms windows in parallel with STT, surface a running risk score, and short-circuit the LLM call if any window crosses the block threshold. Adds no perceptible latency on a clean path; cuts off the model on a hot path before it can act.
None of these change your STT choice. The scanner sits beside the STT, not in front of it.
Latency budget
The waveform classifier in Glyphward’s production pipeline runs in tens of milliseconds for a typical 1–5 second utterance on commodity inference hardware — well inside the latency envelope of a voice agent that already accepts STT round-trip. If your pipeline already runs Whisper-small for transcription, you can share the model output and the marginal cost on top is the waveform path alone. Public benchmarks land in the API docs at GA. The free tier lets you run the same calls against the public WhisperInject sample set today.
How Glyphward fits a voice-agent stack
Glyphward’s `/v1/scan` endpoint accepts a waveform — raw bytes, WAV, or common containers — and returns a 0–100 risk score, the modality flag, the classifier confidence, and a flagged timestamp range. Drop it in front of your LLM call, behind your STT, or in parallel with both — the response shape does not change. Free tier: 10 scans a day, no card. Pro: 100,000 scans/month at $29. Team: 1,000,000 at $99 with audit log. See the full pricing page or the comparison vs Lakera, LLM Guard, Azure Prompt Shields and Promptfoo. Among self-serve scanners under $100/mo, Glyphward is currently the only one with a production audio pipeline.
Related questions
Does the scanner work with Deepgram, AWS, or Gemini audio — not just Whisper?
Yes. The waveform classifier inspects the raw bytes before any STT touches them, so the choice of downstream STT does not matter for coverage. The transcript-side filter is STT-agnostic by design (it runs over whatever string comes out). Any combination works.
Can I run this purely client-side in the browser for a web-voice agent?
The widget at /embed/preview is a client-only demo of the same scoring logic for the upload-and-score flow; production voice-agent integrations call the API server-side because the model weights and the corpus the score depends on are not shipped to the browser.
What about deepfakes or voice cloning?
Out of scope. Deepfake detection answers “is this the right person?”; PI detection answers “is this an instruction payload?”. Both can be true at once — pair Glyphward with an authenticity-focused detector if you need both layers.
How does this compare to running Lakera or LLM Guard on the transcript?
It is additive, not a replacement. Lakera and LLM Guard are excellent text PI scanners; both, by design, run on the transcript. Glyphward sits one step earlier in the pipeline and runs on the bytes. The recommended stack is a transcript-side scanner you already trust plus Glyphward on the waveform — see vs Lakera Guard and vs LLM Guard.
Does this work for pre-recorded audio (file upload) as well as live mic?
Yes — the API does not distinguish. Files larger than the request limit can be chunked client-side. The async batch pattern above is the canonical recipe for batch-scanning archived call recordings.
Further reading
- WhisperInject detection — the canonical attack on the audio side, explained.
- Audio prompt-injection detection — the broader class beyond WhisperInject (out-of-band, silence steg, adversarial perturbation, multi-speaker overlay).
- The multimodal prompt-injection threat model for AI product teams (2026) — full threat model and a 5-step defender’s playbook.
- Multimodal LLM security API — the category page covering image and audio in one endpoint.
- Lakera alternative (multimodal) — why text-first defenders leave audio uncovered.