whisper-test

Transcribe WAV audio files using OpenAI Whisper for intelligibility testing. Use when transcribing audio, testing speech output quality, running whisper, or checking if generated audio is intelligible.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "whisper-test" with this command: npx skills add trevors/dot-claude/trevors-dot-claude-whisper-test

Whisper Audio Intelligibility Test

Transcribe WAV audio files using OpenAI Whisper and report whether the speech is intelligible. Optionally compare against expected text.

Setup

Whisper is installed as a uv tool: uv tool install openai-whisper.

Since this machine may lack ffmpeg, always use the Python API approach that loads WAV files with scipy (bypasses the ffmpeg requirement).

Running Transcription

Use uv run --no-project --with openai-whisper --with scipy --python 3.11 to execute the transcription script:

uv run --no-project --with openai-whisper --with scipy --python 3.11 \
  python3 ~/.claude/skills/whisper-test/transcribe.py \
  [--model tiny|base|small|medium|large-v3] \
  [--language en] \
  [--expected "expected text"] \
  [--json] \
  file1.wav [file2.wav ...]

Arguments

  • --model: Whisper model size (default: large-v3). See model selection guide below.
  • --language: Language hint (default: en).
  • --expected: Expected transcription text. When provided, calculates Word Error Rate (WER).
  • --json: Output results as JSON instead of human-readable text.
  • Positional: One or more WAV file paths.

Model Selection

Use large-v3 for TTS quality verification. Smaller models hallucinate or miss words in synthesized speech, making them unreliable for judging output quality.

ModelVRAMWhen to use
large-v3~10 GBDefault. TTS evaluation, quality gating, regression testing
medium~5 GBGPU memory constrained, still decent accuracy
small~2 GBQuick smoke tests only
base~1 GBNot recommended for TTS — high hallucination rate
tiny~1 GBNot recommended for TTS — unreliable

Observed with identical Qwen3-TTS 1.7B voice-cloned output:

  • large-v3: "That's one tank. Flash attention pipeline." (key phrase captured)
  • base: "That's one thing, flash attention pipeline." (close but hallucinated)

For poor-quality 0.6B output, base hallucinated "Charging Wheel" while large-v3 gave "Flat, splashes." — honest about the poor quality instead of confabulating plausible words.

Output Format

For each file, prints:

filename.wav:
  transcription: "Hello world, this is a test."
  duration: 2.96s
  rms: 0.0866
  peak: 0.6832
  silence: 49.2%
  [wer: 0.0%]  (if --expected provided)

Interpreting Results

TranscriptionMeaning
Matches expected textAudio is intelligible and correct
Partial matchAudio has some speech but quality issues
Empty string ""Audio is unintelligible (noise, silence, or garbage)
Hallucinated textModel heard something in noise (common with Whisper, especially smaller models)

Audio Quality Indicators

  • RMS < 0.01: Essentially silent
  • silence > 80%: Mostly silence, likely no speech
  • peak < 0.05: Very quiet, may not contain useful audio

TTS-Specific Patterns

Voice-cloned TTS output often has these characteristics:

  • Garbled opening, clear ending: Common with ICL voice cloning on short references. The model needs a few frames to "lock in" to the target voice.
  • Key phrases preserved: Even when WER is high, domain-specific terms (e.g. "flash attention pipeline") often come through clearly.
  • Smaller models produce worse audio: 0.6B models produce significantly less intelligible output than 1.7B — expect Whisper to reflect this.

Batch Testing (TTS Variant Comparison)

When testing multiple TTS outputs against expected text:

uv run --no-project --with openai-whisper --with scipy --python 3.11 \
  python3 ~/.claude/skills/whisper-test/transcribe.py \
  --expected "Hello world, this is a test." \
  variant1.wav variant2.wav variant3.wav

This produces a comparison table showing which variants produce intelligible speech.

Docker / NGC Container Usage

When testing on a GPU box inside an NGC container (e.g. for CUDA flash-attn builds), ffmpeg isn't available and apt can be slow. Two workarounds:

  1. Static ffmpeg binary (fast, no apt):

    curl -sL https://johnvansickle.com/ffmpeg/releases/ffmpeg-release-arm64-static.tar.xz \
      | tar xJ --strip-components=1 -C /usr/local/bin/ --wildcards "*/ffmpeg" "*/ffprobe"
    pip install openai-whisper
    
  2. Use scipy loader (this script's default — no ffmpeg needed):

    pip install openai-whisper scipy
    python3 ~/.claude/skills/whisper-test/transcribe.py --model large-v3 output.wav
    

The script loads WAV files directly via scipy, bypassing Whisper's ffmpeg dependency entirely. This works for WAV files (the standard TTS output format).

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

notion-formatter

No summary provided by upstream source.

Repository SourceNeeds Review
General

book-reader

No summary provided by upstream source.

Repository SourceNeeds Review
General

svelte5

No summary provided by upstream source.

Repository SourceNeeds Review