Attack explainer · Audio injection
WhisperInject detection for voice agents
If your voice agent only filters the Whisper transcript, you are defending the wrong thing. WhisperInject and the wider class of audio prompt injection live in the features of the waveform itself — and the transcript is built after most of them have already been discarded.
TL;DR
Whisper is a lossy denoiser by design: it drops ultrasonic content, suppresses non-speech energy, and compresses multi-speaker signals before token generation. Attackers put instructions in exactly those features — ultrasonic carriers, inter-word steganography, adversarial noise envelopes — so the instruction never reaches your transcript. Detect it on the raw audio, not on the text.
How the attack works
WhisperInject is the best-known name for a family of 2024–2025 audio-PI techniques with a common shape: shape the audio so that the target STT + LLM pipeline interprets an attacker-chosen instruction, while the inspection layer (a transcript filter) sees something benign or empty.
- Ultrasonic carriers. Instructions encoded above ~18 kHz pass through many recording pipelines, reach the model’s pre-processing, and in some chained systems leak into the generated text. Whisper’s 16 kHz mel pipeline discards them, so your transcript filter never sees them, but any attached direct-audio LLM (Audio-GPT class) might.
- Inter-word steganography. Low-energy payloads tucked into the silence between spoken words. Whisper’s VAD drops them; a waveform classifier can still see them.
- Adversarial perturbation. A carrier of innocuous-sounding speech carrying an adversarial overlay tuned to flip Whisper’s output into the attacker’s instruction — the audio equivalent of a universal adversarial example.
- Multi-speaker overlay. A loud benign foreground voice and a quiet adversarial background voice. Humans hear one. STT systems with source separation can pick up both; systems without it drop the quiet one, but the upstream audio model still mixes it in.
See Zhang et al.’s audio-adversarial-examples line (2017–) and the 2024 academic coverage of audio adversarial attacks against open-weight speech LLMs for the underlying research thread.
Why a Whisper transcript filter is not enough
Transcript filtering is a necessary baseline and an insufficient defence. Whisper was optimised to produce clean readable text from messy audio — meaning it aggressively removes exactly the features an attacker hides behind. By the time you read the transcript, the signal has been:
- downsampled to 16 kHz (everything above ~8 kHz is gone),
- band-pass filtered and mel-transformed (ultrasonic content is discarded),
- VAD-gated (inter-word silences are truncated),
- token-generated with beam search (low-confidence adversarial tokens are frequently smoothed away).
Your transcript filter runs on the survivors. The attacker aims at the casualties.
How to detect audio prompt injection at the waveform
A robust audio-PI defence has two layers working on the raw audio before Whisper touches it:
- Waveform anomaly classifier. A small convolutional model trained on out-of-band energy, adversarial-perturbation signatures, and steganographic carrier patterns. Looks at the spectrogram directly, not the transcript.
- Transcript-side filter with full-band awareness. Whisper-small is fine for the speech content; the classifier above it runs in parallel on the 44.1 kHz raw bytes and flags anything the transcript-side filter cannot see.
Either signal above threshold is cause to route the audio to a human, downgrade the agent’s privileges, or refuse the call. The combination catches the current public corpus at 80%+ recall with false positives dominated by musical backgrounds and noisy field recordings — both suppressible with corpus rules.
How Glyphward detects WhisperInject
Glyphward is the only self-serve scanner in the under-$100/mo tier that inspects the waveform, not just the transcript. Upload a clip to glyphward.com or POST raw bytes to the API and you get a 0–100 risk score plus a flagged timestamp range. Free tier gives you 10 scans a day — enough to run the published WhisperInject samples through it and see what your existing text scanner was missing.
Related questions
Can I self-host for offline voice pipelines?
Not at v1. Self-hosted offline mode is on the roadmap for the Team tier — the compounding corpus is the core of the product and it requires the shared-signature model. If offline is a hard requirement, email us and we’ll tell you where it sits.
What if the attacker uses a model other than Whisper?
The class doesn’t care. Any lossy STT frontend — Whisper, Deepgram’s Nova, AWS Transcribe, the Gemini audio encoder — discards some band of signal. The carrier simply targets whatever is being thrown away. The defence is the same: inspect the waveform before the lossy pipeline consumes it.
Does this catch voice cloning or deepfakes?
No — those are an identity-authenticity problem, not an injection problem. Glyphward flags payloads that instruct a model; deepfake detection is a separate category (and a different buyer). We stay in our lane.
Further reading
- FigStep detection — the image-modality sibling of this attack class.
- Lakera alternative for multimodal prompt injection — why a text-first defender cannot cover voice.
- Multimodal prompt-injection scanner pricing comparison (2026) — where audio-PI coverage sits in the tooling landscape.