Whisper Audio Intelligibility Test

Transcribe WAV audio files using OpenAI Whisper and report whether the speech is intelligible. Optionally compare against expected text.

Setup

Whisper is installed as a uv tool: uv tool install openai-whisper.

Since this machine may lack ffmpeg, always use the Python API approach that loads WAV files with scipy (bypasses the ffmpeg requirement).

Running Transcription

Use uv run --no-project --with openai-whisper --with scipy --python 3.11 to execute the transcription script:

uv run --no-project --with openai-whisper --with scipy --python 3.11 \
  python3 ~/.claude/skills/whisper-test/transcribe.py \
  [--model tiny|base|small|medium|large-v3] \
  [--language en] \
  [--expected "expected text"] \
  [--json] \
  file1.wav [file2.wav ...]

Arguments

--model: Whisper model size (default: large-v3). See model selection guide below.
--language: Language hint (default: en).
--expected: Expected transcription text. When provided, calculates Word Error Rate (WER).
--json: Output results as JSON instead of human-readable text.
Positional: One or more WAV file paths.

Model Selection

Use large-v3 for TTS quality verification. Smaller models hallucinate or miss words in synthesized speech, making them unreliable for judging output quality.

Model	VRAM	When to use
`large-v3`	~10 GB	Default. TTS evaluation, quality gating, regression testing
`medium`	~5 GB	GPU memory constrained, still decent accuracy
`small`	~2 GB	Quick smoke tests only
`base`	~1 GB	Not recommended for TTS — high hallucination rate
`tiny`	~1 GB	Not recommended for TTS — unreliable

Observed with identical Qwen3-TTS 1.7B voice-cloned output:

large-v3: "That's one tank. Flash attention pipeline." (key phrase captured)
base: "That's one thing, flash attention pipeline." (close but hallucinated)

For poor-quality 0.6B output, base hallucinated "Charging Wheel" while large-v3 gave "Flat, splashes." — honest about the poor quality instead of confabulating plausible words.

Output Format

For each file, prints:

filename.wav:
  transcription: "Hello world, this is a test."
  duration: 2.96s
  rms: 0.0866
  peak: 0.6832
  silence: 49.2%
  [wer: 0.0%]  (if --expected provided)

Interpreting Results

Transcription	Meaning
Matches expected text	Audio is intelligible and correct
Partial match	Audio has some speech but quality issues
Empty string `""`	Audio is unintelligible (noise, silence, or garbage)
Hallucinated text	Model heard something in noise (common with Whisper, especially smaller models)

Audio Quality Indicators

RMS < 0.01: Essentially silent
silence > 80%: Mostly silence, likely no speech
peak < 0.05: Very quiet, may not contain useful audio

TTS-Specific Patterns

Voice-cloned TTS output often has these characteristics:

Garbled opening, clear ending: Common with ICL voice cloning on short references. The model needs a few frames to "lock in" to the target voice.
Key phrases preserved: Even when WER is high, domain-specific terms (e.g. "flash attention pipeline") often come through clearly.
Smaller models produce worse audio: 0.6B models produce significantly less intelligible output than 1.7B — expect Whisper to reflect this.

Batch Testing (TTS Variant Comparison)

When testing multiple TTS outputs against expected text:

uv run --no-project --with openai-whisper --with scipy --python 3.11 \
  python3 ~/.claude/skills/whisper-test/transcribe.py \
  --expected "Hello world, this is a test." \
  variant1.wav variant2.wav variant3.wav

This produces a comparison table showing which variants produce intelligible speech.

Docker / NGC Container Usage

When testing on a GPU box inside an NGC container (e.g. for CUDA flash-attn builds), ffmpeg isn't available and apt can be slow. Two workarounds:

Static ffmpeg binary (fast, no apt):

curl -sL https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-arm64-static.tar.xz \
  | tar xJ --strip-components=1 -C /usr/local/bin/ --wildcards "*/ffmpeg" "*/ffprobe"
pip install openai-whisper

Use scipy loader (this script's default — no ffmpeg needed):

pip install openai-whisper scipy
python3 ~/.claude/skills/whisper-test/transcribe.py --model large-v3 output.wav

The script loads WAV files directly via scipy, bypassing Whisper's ffmpeg dependency entirely. This works for WAV files (the standard TTS output format).

whisper-test

Safety Notice

Copy this and send it to your AI assistant to learn