Category explainer · Audio injection

Audio prompt-injection detection

Audio prompt injection is the class of attacks that put an instruction into a waveform in a way that reaches the model while being absent from the transcript your filter reads. WhisperInject is the most-cited example, but the class is broader — it covers every STT or audio-LLM pipeline that discards any band of signal, which is all of them.

STT drops signal by design. Attackers put instructions in what gets dropped. Detect at the waveform, not the transcript.

Any STT frontend — Whisper, Deepgram, AWS Transcribe, Gemini’s audio encoder — is a lossy compressor. Ultrasonic content, inter-word silences, low-confidence tokens, and quiet background voices are thrown away before you see a string. A transcript filter cannot detect instructions the transcript never contained. A scanner has to look at the raw bytes.

The subtypes of audio prompt injection

Beyond the specific WhisperInject recipe, audio PI has four recurring shapes in the 2024–2026 literature and incident reports:

  1. Out-of-band carriers. Ultrasonic (above ~18 kHz) or near-subsonic signal encoding instructions that a 16 kHz downsample discards. Survives to any downstream audio LLM that ingests raw bytes directly.
  2. Silence steganography. Low-amplitude payload tucked into the gaps the STT’s voice-activity detection truncates. Inaudible to humans; visible in the spectrogram.
  3. Adversarial waveform perturbation. Crafted noise overlaid on benign speech that flips the STT output into the attacker’s instruction — the audio equivalent of a universal adversarial example. Well-documented in the Zhang et al. line of work from 2017 onward and follow-ups through 2024–2025 targeting open-weight speech LLMs.
  4. Multi-speaker overlay. Benign foreground voice plus an adversarial background voice below the noise-floor threshold. Source-separating systems pick up both; single-voice STTs drop the quiet one. Either way, direct-audio LLMs mix it in.

All four share a structural feature: the signal your defender reads (a transcript) and the signal the model acts on (either the same transcript, or raw audio fed to an audio LLM) are not the same artefact. You are defending the wrong artefact.

Why transcript filtering is a necessary-but-insufficient defence

Transcript filters are cheap and useful — they catch overt jailbreaks in spoken prompts and policy violations that survive transcription. They fail in exactly one way: by not existing where the attacker is hiding. By the time your transcript filter runs, Whisper has already downsampled to 16 kHz, band-pass filtered, VAD-gated, and beam-search-smoothed. Each step is a filter that could have been the attack surface. The defender that runs only on the output has no purchase on the signal that was removed.

Upgrading STTs does not close the gap. Every STT trade-off — latency, transcript cleanliness, robustness to noise — is a preference about what to drop. The attacker simply targets whatever is being dropped by the STT you pick. The fix is to inspect the waveform in parallel with the transcript, not to pick a “better” STT.

How to detect audio PI at the waveform

The two-signal ensemble:

  1. Waveform anomaly classifier. A small convolutional model over the full-band (44.1 kHz or native-rate) spectrogram. Trained on out-of-band energy patterns, adversarial-perturbation artefacts, and the hallmarks of silence-steganography and multi-speaker overlays. Returns an anomaly score independent of any transcription.
  2. Transcript-side PI filter. Whisper-small produces a transcript; a standard text PI classifier runs over it. Catches overt audible jailbreaks that were never hidden in the waveform.

Either signal above threshold is cause to block, route to a human, or downgrade the voice agent’s privileges. Run in parallel, not in series — routing the transcript through the waveform detector does not work, because the transcript is the feature the waveform detector is defending against.

How Glyphward does audio PI detection

Glyphward’s audio endpoint accepts a waveform (raw bytes, WAV, or common container) and returns a 0–100 risk score plus a flagged timestamp range. It runs the waveform classifier and the transcript-side filter on every scan and returns both signals in the response so your integration can apply its own policy. The free tier gives 10 scans a day with no card. Pro and Team tiers cover production-scale voice-agent volumes; see pricing or the full comparison page. Among self-serve scanners under $100/mo, we are currently the only one with an audio pipeline in production.

Get early access

Related questions

Does this work with Deepgram / AWS / Gemini audio, not just Whisper?

Yes. The waveform classifier inspects the bytes before any STT touches them, so the choice of downstream STT does not matter. The transcript-side filter is STT-agnostic (it runs on whatever text comes out). Coverage is consistent across STT vendors.

What latency does this add to a real-time voice pipeline?

Typical scan adds tens of milliseconds over the waveform classifier plus the time to run Whisper-small on the transcript side. Many voice pipelines already run Whisper-small for transcription — in that case you can share the transcript and the marginal latency is only the waveform classifier. Full numbers land in the public API docs at launch.

Does this cover deepfake or voice-cloning attacks?

No. Deepfake detection is an identity-authenticity problem; audio PI detection is an instruction-payload problem. They often co-occur (a cloned voice delivering an adversarial payload) but the defences are orthogonal. We stay in our lane; pair us with a deepfake detector if you need both.

Further reading