Audio Transcription Pipeline

VAD-first, evidence-backed Whisper transcription running locally on Apple Silicon. Produces high-quality transcripts with zero hallucinations, correct domain terminology, and LLM-polished output.

Architecture: Audio → ffmpeg (16kHz WAV) → Silero-VAD (speech segmentation) → MLX Whisper → Dictionary replacement → Claude LLM correction → Output.

Prerequisites

macOS with Apple Silicon (M1/M2/M3/M4) — required for MLX
Python 3.9+ — Xcode Python at /Library/Developer/CommandLineTools/usr/bin/python3 or Homebrew Python
ffmpeg — for audio preprocessing (brew install ffmpeg)
Anthropic API key — for LLM correction step (optional but recommended)

Setup

Detection Order

Check if the pipeline is already installed:

Check config — ls ~/.config/transcribe/client.py
Check Python packages — python3 -c "import mlx_whisper, torch, scipy, anthropic"
No installation — Run the First-Time Setup below

Using Existing Installation

If ~/.config/transcribe/client.py exists, the pipeline is ready:

# CLI usage
PYTHON="/Library/Developer/CommandLineTools/usr/bin/python3"
$PYTHON ~/.config/transcribe/client.py "/path/to/audio.m4a"

# With options
$PYTHON ~/.config/transcribe/client.py "/path/to/audio.m4a" --language no --no-llm

Or via Python API:

import sys
from pathlib import Path
sys.path.insert(0, str(Path.home() / '.config/transcribe'))
from client import transcribe

result = transcribe("/path/to/audio.m4a")
print(f"Output: {result.corrected_path}")
print(f"Words: {result.corrected_text.split().__len__()}")
print(f"Speed: {result.audio_duration_s / result.total_time_s:.1f}x realtime")

First-Time Setup

Step 1: Identify the Python environment

MLX requires Apple Silicon. Find the right Python:

# Option A: Xcode Python (most reliable for ML packages)
PYTHON="/Library/Developer/CommandLineTools/usr/bin/python3"
$PYTHON --version

# Option B: Homebrew Python
PYTHON="python3"
$PYTHON --version

Use whichever has mlx installed, or install from scratch with one of them.

Step 2: Install Python packages

# Core ML packages
$PYTHON -m pip install --user mlx mlx_whisper

# VAD (Voice Activity Detection) — requires torch
$PYTHON -m pip install --user torch torchaudio

# Audio processing
$PYTHON -m pip install --user numpy scipy

# LLM correction (optional but recommended)
$PYTHON -m pip install --user anthropic

Verify installation:

$PYTHON -c "
import mlx_whisper; print('mlx_whisper OK')
import torch; print(f'torch {torch.__version__} OK')
import scipy; print('scipy OK')
import numpy; print('numpy OK')
try:
    import anthropic; print('anthropic OK')
except: print('anthropic NOT installed (LLM correction will be unavailable)')
"

Step 3: Install ffmpeg

# macOS
brew install ffmpeg

# Verify
ffmpeg -version | head -1

Step 4: Create the pipeline directory

mkdir -p ~/.config/transcribe/models

Step 5: Create configuration files

Create ~/.config/transcribe/config.json:

{
  "models": {
    "norwegian": "nb-whisper-medium-mlx",
    "norwegian_fallback": "aalst/nb-whisper-large-distil-turbo-beta-mlx",
    "english": "mlx-community/whisper-large-v3-turbo",
    "multilingual": "mlx-community/whisper-large-v3-turbo"
  },
  "local_model_dir": "~/.config/transcribe/models",
  "language": "auto",
  "auto_detect_segments": 5,
  "auto_detect_norwegian_threshold": 0.8,
  "vad_enabled": true,
  "vad_chunk_max_seconds": 28,
  "vad_padding_ms": 400,
  "llm_correction": true,
  "llm_model": "claude-sonnet-4-20250514",
  "llm_chunk_words": 500,
  "output_dir": "/tmp/transcribe",
  "whisper_params": {
    "condition_on_previous_text": false,
    "best_of": 1,
    "compression_ratio_threshold": 1.8,
    "no_speech_threshold": 0.5,
    "temperature": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
    "hallucination_silence_threshold": 2.0
  }
}

Create ~/.config/transcribe/dictionary.json:

{
  "initial_prompts": {
    "no": "Velkommen til møtet. Vi diskuterer teknologi, innovasjon og prosjektledelse.",
    "en": "Welcome to the meeting. We discuss technology, innovation and project management.",
    "auto": "Meeting about technology and innovation."
  },
  "replacements": {},
  "context_hints": {
    "no": [
      "This is a Norwegian meeting transcript",
      "Fix obvious speech-to-text errors only"
    ],
    "en": [
      "This is an English meeting transcript",
      "Fix obvious speech-to-text errors only"
    ],
    "auto": [
      "This transcript may contain both Norwegian and English",
      "Fix obvious speech-to-text errors only"
    ]
  }
}

Customize the dictionary for the user's domain:

initial_prompts — Natural sentences with domain vocabulary (max 224 tokens). These bias Whisper toward correct recognition.
replacements — Regex-based fixes for known misrecognitions (e.g., "cloud code": "Claude Code"). Case-insensitive.
context_hints — Instructions for the LLM correction pass describing the recording's context.

Step 6: Create the `.env` file (for LLM correction)

echo "ANTHROPIC_API_KEY=sk-ant-..." > ~/.config/transcribe/.env

The pipeline loads this automatically if ANTHROPIC_API_KEY is not in the shell environment.

Step 7: Write the pipeline script

Create ~/.config/transcribe/client.py with the full pipeline code. The source is at:

If the user already has it: Check ~/.config/transcribe/client.py
If not: The complete pipeline (~400 lines) must be written. It implements the 6-stage architecture described in the Pipeline Architecture section below.

The client.py shebang should match the Python used in Step 1:

#!/Library/Developer/CommandLineTools/usr/bin/python3

Step 8: Download models (happens automatically on first run)

The first transcription will download the required Whisper model from HuggingFace (~1.5GB). Subsequent runs use the cached model.

# Pre-download Norwegian model (optional)
$PYTHON -c "import mlx_whisper; mlx_whisper.transcribe('/dev/null', path_or_hf_repo='aalst/nb-whisper-large-distil-turbo-beta-mlx')" 2>/dev/null

# Pre-download English/multilingual model (optional)
$PYTHON -c "import mlx_whisper; mlx_whisper.transcribe('/dev/null', path_or_hf_repo='mlx-community/whisper-large-v3-turbo')" 2>/dev/null

Usage

CLI

PYTHON="/Library/Developer/CommandLineTools/usr/bin/python3"

# Basic (auto language detection, VAD, LLM correction)
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a"

# Force Norwegian
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --language no

# Force English
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --language en

# Skip LLM correction (faster, raw output only)
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --no-llm

# Skip VAD (process as single chunk — NOT recommended)
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --no-vad

# Re-run only LLM correction on existing transcript
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --llm-only

# Custom output directory
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --output-dir /path/to/output

Python API

import sys
from pathlib import Path
sys.path.insert(0, str(Path.home() / '.config/transcribe'))
from client import transcribe

# Full pipeline
result = transcribe("/path/to/audio.m4a")

# Access results
print(result.corrected_text)       # Final corrected transcript
print(result.raw_text)             # Raw Whisper output
print(result.corrected_path)       # Path to corrected.txt
print(result.raw_path)             # Path to raw.txt
print(result.metadata_path)        # Path to metadata.json
print(result.language)             # Detected language
print(result.model_used)           # Which model was used
print(result.audio_duration_s)     # Audio length in seconds
print(result.num_vad_segments)     # Number of VAD chunks
print(result.transcribe_time_s)    # Whisper processing time
print(result.llm_correct_time_s)   # LLM correction time
print(result.total_time_s)         # Total pipeline time

# With overrides
result = transcribe(
    "/path/to/audio.m4a",
    language="no",              # Force Norwegian
    vad_enabled=True,           # Default
    llm_correction=False,       # Skip LLM step
    output_dir="/custom/path",  # Custom output
    initial_prompt="Meeting about renewable energy at Equinor..."
)

Running as Background Task

Transcription of long audio files takes significant time. Run as a background task:

PYTHON="/Library/Developer/CommandLineTools/usr/bin/python3"
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" 2>&1

Set run_in_background: true when invoking via Bash tool. Monitor output for progress:

[1/6] Preprocessing... — ffmpeg conversion
[2/6] VAD Segmentation... — speech detection, shows chunk count
[3/6] Transcribing... — chunk-by-chunk progress
[5/6] LLM Correction... — parallel API calls
COMPLETE — with final stats

Output

The pipeline saves to {output_dir}/{audio_name}/:

File	Contents	When
`raw.txt`	Raw Whisper output (no corrections)	Always
`dict_corrected.txt`	After dictionary replacements	Always
`corrected.txt`	After LLM correction	When LLM enabled
`metadata.json`	Stats, timing, config snapshot	Always
`input.wav`	16kHz mono WAV (preprocessed)	Always
`chunks/`	Individual chunk WAVs and transcripts	Always

metadata.json contains:

{
  "audio_path": "/original/path.m4a",
  "audio_duration_s": 3261.8,
  "language_mode": "auto",
  "language_detected": "no",
  "model_used": "nb-whisper-medium-mlx",
  "vad_enabled": true,
  "num_vad_segments": 140,
  "llm_correction": true,
  "llm_model": "claude-sonnet-4-20250514",
  "transcribe_time_s": 298.8,
  "llm_correct_time_s": 41.2,
  "total_time_s": 340.0,
  "raw_word_count": 7005,
  "corrected_word_count": 6975
}

Report key stats to the user: duration, processing speed (Xx realtime), word count, output paths.

Pipeline Architecture

Stage 1: Preprocess (ffmpeg)

Converts any audio format (.m4a, .mp3, .wav, .ogg, .flac, etc.) to 16kHz mono WAV for Whisper. Skips if input.wav already exists (safe to re-run).

Stage 2: VAD Segmentation (Silero-VAD)

The #1 quality intervention. Silero-VAD detects speech segments and merges them into chunks of maximum 28 seconds (matching Whisper's 30s internal window). Each chunk gets 400ms padding at boundaries.

Why this matters:

Whisper hallucinates during silence (repeats text, generates phantom speech)
VAD eliminates silence from Whisper's input entirely
Reduces hallucination rate from ~20% to 0.2% (Baranski et al., 2025)
687 raw VAD segments → 140 merged chunks on a 54-min meeting

Audio loading uses scipy.io.wavfile (not torchaudio — no audio backends on macOS).

Stage 3: Transcription (MLX Whisper)

Language-aware model routing:

Language	Model	When
Norwegian (`no`)	NB-Whisper Medium MLX	Pure Norwegian audio
English (`en`)	whisper-large-v3-turbo	Pure English audio
Auto (default)	Detect → route	Mixed or unknown

Auto-detection: Runs Whisper language detection on the first 5 VAD segments. If >80% detected as Norwegian → NB-Whisper. Otherwise → standard whisper-large-v3-turbo.

Evidence-backed Whisper parameters:

condition_on_previous_text=False — prevents hallucination propagation between chunks
No beam_size parameter — mlx_whisper only supports greedy decoding (beam search not implemented)
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0) — fallback tuple for recovery when model gets stuck
hallucination_silence_threshold=2.0 — marginal but helps catch remaining silence artifacts
initial_prompt — natural sentence with domain vocabulary (not a term list), max 224 tokens

Memory management between chunks:

mx.set_cache_limit(100_000_000)  # 100MB cache limit
mx.clear_cache()                  # Between each chunk

Stage 4: Dictionary Replacement

Regex-based case-insensitive replacements from dictionary.json. Catches systematic ASR errors that are consistent across runs:

{
  "cloud code": "Claude Code",
  "entropic": "Anthropic",
  "terroristiske": "deterministiske",
  "co-pilot": "CoPilot"
}

This is the only stage with 100% precision — it fixes known errors every time.

Stage 5: LLM Correction (Claude Sonnet)

Splits text into ~500-word chunks and sends them in parallel (4 concurrent API calls) to Claude Sonnet for correction. The system prompt instructs strict error-fixing only:

Fix obvious misrecognitions
Fix punctuation and capitalization
Preserve the speaker's actual meaning
Do NOT rephrase, summarize, or restructure
Return ONLY the corrected text

Context hints from dictionary.json provide domain knowledge (company names, technical terms, meeting context).

Loads ANTHROPIC_API_KEY from:

Environment variable (if set)
~/.config/transcribe/.env file (fallback)

Stage 6: Output

Saves raw.txt, dict_corrected.txt, corrected.txt, and metadata.json. Each intermediate file serves as a resume point — if the pipeline fails at stage 5, the raw transcript is preserved.

Configuration Reference

config.json

Key	Default	Description
`models.norwegian`	`nb-whisper-medium-mlx`	Model for Norwegian audio
`models.norwegian_fallback`	`aalst/nb-whisper-large-distil-turbo-beta-mlx`	Fallback if local model not converted
`models.english`	`mlx-community/whisper-large-v3-turbo`	Model for English audio
`models.multilingual`	`mlx-community/whisper-large-v3-turbo`	Model for auto-detection and mixed audio
`local_model_dir`	`~/.config/transcribe/models`	Where converted models are stored
`language`	`auto`	Default language mode (`no`, `en`, `auto`)
`auto_detect_segments`	`5`	Number of VAD segments used for language detection
`auto_detect_norwegian_threshold`	`0.8`	Ratio of Norwegian-detected segments to use NB-Whisper
`vad_enabled`	`true`	Enable VAD segmentation (strongly recommended)
`vad_chunk_max_seconds`	`28`	Maximum chunk duration in seconds
`vad_padding_ms`	`400`	Padding added to each side of speech segments
`llm_correction`	`true`	Enable Claude LLM correction pass
`llm_model`	`claude-sonnet-4-20250514`	Claude model for LLM correction
`llm_chunk_words`	`500`	Words per LLM correction chunk
`output_dir`	`/tmp/transcribe`	Base output directory

dictionary.json

Key	Description
`initial_prompts`	Per-language natural sentences with domain vocabulary (max 224 tokens)
`replacements`	Regex-based case-insensitive find/replace pairs
`context_hints`	Per-language lists of context lines for LLM correction prompts

Customizing for a Domain

To adapt the pipeline for a new domain:

Add initial prompts with relevant vocabulary as natural sentences:

"no": "Velkommen til styremøte i Equinor. Vi diskuterer havvind, karbonfangst, og produksjonsrapporter."

Add replacements for words you know Whisper consistently misrecognizes:
```
"equi nor": "Equinor",
"havvin": "havvind"
```

Add context hints so the LLM correction understands the domain:

"no": [
  "This is a Norwegian board meeting at an energy company",
  "Key terms: havvind, karbonfangst, CCS, produksjonsvolum"
]

Performance

Benchmarked on a 54-minute Norwegian meeting recording (M1 MacBook Pro 16GB):

Metric	Value
Audio duration	3,262s (54 min)
Transcription time	299s
LLM correction time	41s
Total time	340s
Processing speed	9.6x realtime
VAD segments	140 chunks (from 687 raw)
Word count	6,975 (corrected)
Hallucinations	Zero
Brand name accuracy	100% (via dictionary)

Comparison with Other Approaches

Approach	Hallucinations	Brand Names	Words
This pipeline	None	100%	6,975
OpenAI Whisper API	None	0%	7,974
MLX basic (single pass)	Severe (44x repeats)	0%	7,973
whisper-large-v3-turbo	Moderate (7x repeats)	0%	9,077

Critical Gotchas

1. mlx_whisper uses kwargs, not a dict

# WRONG — causes TypeError
mlx_whisper.transcribe(path, decode_options={"language": "no"})

# CORRECT — spread as kwargs
mlx_whisper.transcribe(path, language="no", best_of=1)

The mlx_whisper.transcribe() signature uses **decode_options (variadic kwargs), not decode_options: dict. The docstring is misleading.

2. mlx_whisper does NOT implement beam search

# WRONG — raises NotImplementedError
mlx_whisper.transcribe(path, beam_size=1)

# CORRECT — omit beam_size entirely (greedy decoding only)
mlx_whisper.transcribe(path)

Any non-None beam_size triggers NotImplementedError: Beam search decoder is not yet implemented. Greedy decoding (the default) is equivalent to beam_size=1.

3. torchaudio has no audio backends on macOS

# WRONG — fails with "No audio backend is available"
import torchaudio
wav, sr = torchaudio.load("audio.wav")

# CORRECT — use scipy instead
from scipy.io import wavfile
sr, audio_np = wavfile.read("audio.wav")
audio_float = audio_np.astype(np.float32) / 32768.0  # int16 → float32
wav = torch.from_numpy(audio_float)

torchaudio.list_audio_backends() returns [] on macOS with torchaudio 2.8.0. Use scipy.io.wavfile for WAV loading.

4. MLX cache API is deprecated

# WRONG — deprecated
mx.metal.set_cache_limit(100_000_000)
mx.metal.clear_cache()

# CORRECT — new API
mx.set_cache_limit(100_000_000)
mx.clear_cache()

5. ANTHROPIC_API_KEY not in subprocess environment

The Python process spawned by the agent may not have the API key in its environment. The pipeline loads from ~/.config/transcribe/.env as fallback. If LLM correction fails with auth errors, check:

# Verify key exists
cat ~/.config/transcribe/.env

# Or set in environment before running
export ANTHROPIC_API_KEY="sk-ant-..."

6. Temperature must be a tuple, not a float

# WRONG — no fallback recovery
temperature=0.0

# CORRECT — fallback tuple for when model gets stuck
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)

When Whisper gets stuck on a chunk (high compression ratio), it retries with progressively higher temperatures.

7. initial_prompt must be natural text

# WRONG — term list (Whisper treats as transcript prefix)
initial_prompt="Claude Code, Anthropic, MCP, CoPilot, Vercel"

# CORRECT — natural sentence (max 224 tokens)
initial_prompt="Velkommen til møte om teknologi. Vi diskuterer Claude Code, Anthropic, CoPilot og MCP-servere."

Whisper uses the initial prompt as if it were previous transcript text. A term list creates unnatural conditioning. A natural sentence biases vocabulary without distorting output.

8. NB-Whisper Medium MLX needs conversion

The NB-Whisper Medium model (NbAiLab/nb-whisper-medium) is in HuggingFace format, not MLX. Until converted, the pipeline falls back to aalst/nb-whisper-large-distil-turbo-beta-mlx. To convert:

$PYTHON ~/.config/transcribe/convert_model.py

This creates ~/.config/transcribe/models/nb-whisper-medium-mlx/. The fallback model works well — conversion is optional but may improve Norwegian WER.

Supported Audio Formats

Any format ffmpeg can decode: .m4a, .mp3, .wav, .ogg, .flac, .aac, .wma, .webm, .opus, .amr, .mp4 (audio track).

Evidence Basis

The architecture decisions are backed by empirical research:

Decision	Evidence	Source
VAD preprocessing	Reduces hallucinations from ~20% to 0.2%	Baranski et al., 2025 (arXiv:2501.11378)
28s chunk length	Matches Whisper's 30s internal window	NB-Whisper recommendation, WhisperX design
Greedy decoding (beam=1)	beam=1 → 19.5% hallucination vs beam=5 → 28-37%	Baranski et al., 2025
NB-Whisper for Norwegian	WER 7.2% vs OpenAI Large-v3 10.4%	NB-Whisper benchmarks
condition_on_previous=False	Prevents hallucination propagation between chunks	Standard Whisper best practice
Temperature fallback tuple	Recovery when model stuck on dialect variation	Default Whisper behavior