Audio Transcription Pipeline
VAD-first, evidence-backed Whisper transcription running locally on Apple Silicon. Produces high-quality transcripts with zero hallucinations, correct domain terminology, and LLM-polished output.
Architecture: Audio → ffmpeg (16kHz WAV) → Silero-VAD (speech segmentation) → MLX Whisper → Dictionary replacement → Claude LLM correction → Output.
Prerequisites
- macOS with Apple Silicon (M1/M2/M3/M4) — required for MLX
- Python 3.9+ — Xcode Python at
/Library/Developer/CommandLineTools/usr/bin/python3or Homebrew Python - ffmpeg — for audio preprocessing (
brew install ffmpeg) - Anthropic API key — for LLM correction step (optional but recommended)
Setup
Detection Order
Check if the pipeline is already installed:
- Check config —
ls ~/.config/transcribe/client.py - Check Python packages —
python3 -c "import mlx_whisper, torch, scipy, anthropic" - No installation — Run the First-Time Setup below
Using Existing Installation
If ~/.config/transcribe/client.py exists, the pipeline is ready:
# CLI usage
PYTHON="/Library/Developer/CommandLineTools/usr/bin/python3"
$PYTHON ~/.config/transcribe/client.py "/path/to/audio.m4a"
# With options
$PYTHON ~/.config/transcribe/client.py "/path/to/audio.m4a" --language no --no-llm
Or via Python API:
import sys
from pathlib import Path
sys.path.insert(0, str(Path.home() / '.config/transcribe'))
from client import transcribe
result = transcribe("/path/to/audio.m4a")
print(f"Output: {result.corrected_path}")
print(f"Words: {result.corrected_text.split().__len__()}")
print(f"Speed: {result.audio_duration_s / result.total_time_s:.1f}x realtime")
First-Time Setup
Step 1: Identify the Python environment
MLX requires Apple Silicon. Find the right Python:
# Option A: Xcode Python (most reliable for ML packages)
PYTHON="/Library/Developer/CommandLineTools/usr/bin/python3"
$PYTHON --version
# Option B: Homebrew Python
PYTHON="python3"
$PYTHON --version
Use whichever has mlx installed, or install from scratch with one of them.
Step 2: Install Python packages
# Core ML packages
$PYTHON -m pip install --user mlx mlx_whisper
# VAD (Voice Activity Detection) — requires torch
$PYTHON -m pip install --user torch torchaudio
# Audio processing
$PYTHON -m pip install --user numpy scipy
# LLM correction (optional but recommended)
$PYTHON -m pip install --user anthropic
Verify installation:
$PYTHON -c "
import mlx_whisper; print('mlx_whisper OK')
import torch; print(f'torch {torch.__version__} OK')
import scipy; print('scipy OK')
import numpy; print('numpy OK')
try:
import anthropic; print('anthropic OK')
except: print('anthropic NOT installed (LLM correction will be unavailable)')
"
Step 3: Install ffmpeg
# macOS
brew install ffmpeg
# Verify
ffmpeg -version | head -1
Step 4: Create the pipeline directory
mkdir -p ~/.config/transcribe/models
Step 5: Create configuration files
Create ~/.config/transcribe/config.json:
{
"models": {
"norwegian": "nb-whisper-medium-mlx",
"norwegian_fallback": "aalst/nb-whisper-large-distil-turbo-beta-mlx",
"english": "mlx-community/whisper-large-v3-turbo",
"multilingual": "mlx-community/whisper-large-v3-turbo"
},
"local_model_dir": "~/.config/transcribe/models",
"language": "auto",
"auto_detect_segments": 5,
"auto_detect_norwegian_threshold": 0.8,
"vad_enabled": true,
"vad_chunk_max_seconds": 28,
"vad_padding_ms": 400,
"llm_correction": true,
"llm_model": "claude-sonnet-4-20250514",
"llm_chunk_words": 500,
"output_dir": "/tmp/transcribe",
"whisper_params": {
"condition_on_previous_text": false,
"best_of": 1,
"compression_ratio_threshold": 1.8,
"no_speech_threshold": 0.5,
"temperature": [0.0, 0.2, 0.4, 0.6, 0.8, 1.0],
"hallucination_silence_threshold": 2.0
}
}
Create ~/.config/transcribe/dictionary.json:
{
"initial_prompts": {
"no": "Velkommen til møtet. Vi diskuterer teknologi, innovasjon og prosjektledelse.",
"en": "Welcome to the meeting. We discuss technology, innovation and project management.",
"auto": "Meeting about technology and innovation."
},
"replacements": {},
"context_hints": {
"no": [
"This is a Norwegian meeting transcript",
"Fix obvious speech-to-text errors only"
],
"en": [
"This is an English meeting transcript",
"Fix obvious speech-to-text errors only"
],
"auto": [
"This transcript may contain both Norwegian and English",
"Fix obvious speech-to-text errors only"
]
}
}
Customize the dictionary for the user's domain:
initial_prompts— Natural sentences with domain vocabulary (max 224 tokens). These bias Whisper toward correct recognition.replacements— Regex-based fixes for known misrecognitions (e.g.,"cloud code": "Claude Code"). Case-insensitive.context_hints— Instructions for the LLM correction pass describing the recording's context.
Step 6: Create the .env file (for LLM correction)
echo "ANTHROPIC_API_KEY=sk-ant-..." > ~/.config/transcribe/.env
The pipeline loads this automatically if ANTHROPIC_API_KEY is not in the shell environment.
Step 7: Write the pipeline script
Create ~/.config/transcribe/client.py with the full pipeline code. The source is at:
- If the user already has it: Check
~/.config/transcribe/client.py - If not: The complete pipeline (~400 lines) must be written. It implements the 6-stage architecture described in the Pipeline Architecture section below.
The client.py shebang should match the Python used in Step 1:
#!/Library/Developer/CommandLineTools/usr/bin/python3
Step 8: Download models (happens automatically on first run)
The first transcription will download the required Whisper model from HuggingFace (~1.5GB). Subsequent runs use the cached model.
# Pre-download Norwegian model (optional)
$PYTHON -c "import mlx_whisper; mlx_whisper.transcribe('/dev/null', path_or_hf_repo='aalst/nb-whisper-large-distil-turbo-beta-mlx')" 2>/dev/null
# Pre-download English/multilingual model (optional)
$PYTHON -c "import mlx_whisper; mlx_whisper.transcribe('/dev/null', path_or_hf_repo='mlx-community/whisper-large-v3-turbo')" 2>/dev/null
Usage
CLI
PYTHON="/Library/Developer/CommandLineTools/usr/bin/python3"
# Basic (auto language detection, VAD, LLM correction)
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a"
# Force Norwegian
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --language no
# Force English
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --language en
# Skip LLM correction (faster, raw output only)
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --no-llm
# Skip VAD (process as single chunk — NOT recommended)
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --no-vad
# Re-run only LLM correction on existing transcript
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --llm-only
# Custom output directory
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" --output-dir /path/to/output
Python API
import sys
from pathlib import Path
sys.path.insert(0, str(Path.home() / '.config/transcribe'))
from client import transcribe
# Full pipeline
result = transcribe("/path/to/audio.m4a")
# Access results
print(result.corrected_text) # Final corrected transcript
print(result.raw_text) # Raw Whisper output
print(result.corrected_path) # Path to corrected.txt
print(result.raw_path) # Path to raw.txt
print(result.metadata_path) # Path to metadata.json
print(result.language) # Detected language
print(result.model_used) # Which model was used
print(result.audio_duration_s) # Audio length in seconds
print(result.num_vad_segments) # Number of VAD chunks
print(result.transcribe_time_s) # Whisper processing time
print(result.llm_correct_time_s) # LLM correction time
print(result.total_time_s) # Total pipeline time
# With overrides
result = transcribe(
"/path/to/audio.m4a",
language="no", # Force Norwegian
vad_enabled=True, # Default
llm_correction=False, # Skip LLM step
output_dir="/custom/path", # Custom output
initial_prompt="Meeting about renewable energy at Equinor..."
)
Running as Background Task
Transcription of long audio files takes significant time. Run as a background task:
PYTHON="/Library/Developer/CommandLineTools/usr/bin/python3"
$PYTHON ~/.config/transcribe/client.py "/path/to/meeting.m4a" 2>&1
Set run_in_background: true when invoking via Bash tool. Monitor output for progress:
[1/6] Preprocessing...— ffmpeg conversion[2/6] VAD Segmentation...— speech detection, shows chunk count[3/6] Transcribing...— chunk-by-chunk progress[5/6] LLM Correction...— parallel API callsCOMPLETE— with final stats
Output
The pipeline saves to {output_dir}/{audio_name}/:
| File | Contents | When |
|---|---|---|
raw.txt | Raw Whisper output (no corrections) | Always |
dict_corrected.txt | After dictionary replacements | Always |
corrected.txt | After LLM correction | When LLM enabled |
metadata.json | Stats, timing, config snapshot | Always |
input.wav | 16kHz mono WAV (preprocessed) | Always |
chunks/ | Individual chunk WAVs and transcripts | Always |
metadata.json contains:
{
"audio_path": "/original/path.m4a",
"audio_duration_s": 3261.8,
"language_mode": "auto",
"language_detected": "no",
"model_used": "nb-whisper-medium-mlx",
"vad_enabled": true,
"num_vad_segments": 140,
"llm_correction": true,
"llm_model": "claude-sonnet-4-20250514",
"transcribe_time_s": 298.8,
"llm_correct_time_s": 41.2,
"total_time_s": 340.0,
"raw_word_count": 7005,
"corrected_word_count": 6975
}
Report key stats to the user: duration, processing speed (Xx realtime), word count, output paths.
Pipeline Architecture
Stage 1: Preprocess (ffmpeg)
Converts any audio format (.m4a, .mp3, .wav, .ogg, .flac, etc.) to 16kHz mono WAV for Whisper. Skips if input.wav already exists (safe to re-run).
Stage 2: VAD Segmentation (Silero-VAD)
The #1 quality intervention. Silero-VAD detects speech segments and merges them into chunks of maximum 28 seconds (matching Whisper's 30s internal window). Each chunk gets 400ms padding at boundaries.
Why this matters:
- Whisper hallucinates during silence (repeats text, generates phantom speech)
- VAD eliminates silence from Whisper's input entirely
- Reduces hallucination rate from ~20% to 0.2% (Baranski et al., 2025)
- 687 raw VAD segments → 140 merged chunks on a 54-min meeting
Audio loading uses scipy.io.wavfile (not torchaudio — no audio backends on macOS).
Stage 3: Transcription (MLX Whisper)
Language-aware model routing:
| Language | Model | When |
|---|---|---|
Norwegian (no) | NB-Whisper Medium MLX | Pure Norwegian audio |
English (en) | whisper-large-v3-turbo | Pure English audio |
| Auto (default) | Detect → route | Mixed or unknown |
Auto-detection: Runs Whisper language detection on the first 5 VAD segments. If >80% detected as Norwegian → NB-Whisper. Otherwise → standard whisper-large-v3-turbo.
Evidence-backed Whisper parameters:
condition_on_previous_text=False— prevents hallucination propagation between chunks- No
beam_sizeparameter — mlx_whisper only supports greedy decoding (beam search not implemented) temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)— fallback tuple for recovery when model gets stuckhallucination_silence_threshold=2.0— marginal but helps catch remaining silence artifactsinitial_prompt— natural sentence with domain vocabulary (not a term list), max 224 tokens
Memory management between chunks:
mx.set_cache_limit(100_000_000) # 100MB cache limit
mx.clear_cache() # Between each chunk
Stage 4: Dictionary Replacement
Regex-based case-insensitive replacements from dictionary.json. Catches systematic ASR errors that are consistent across runs:
{
"cloud code": "Claude Code",
"entropic": "Anthropic",
"terroristiske": "deterministiske",
"co-pilot": "CoPilot"
}
This is the only stage with 100% precision — it fixes known errors every time.
Stage 5: LLM Correction (Claude Sonnet)
Splits text into ~500-word chunks and sends them in parallel (4 concurrent API calls) to Claude Sonnet for correction. The system prompt instructs strict error-fixing only:
- Fix obvious misrecognitions
- Fix punctuation and capitalization
- Preserve the speaker's actual meaning
- Do NOT rephrase, summarize, or restructure
- Return ONLY the corrected text
Context hints from dictionary.json provide domain knowledge (company names, technical terms, meeting context).
Loads ANTHROPIC_API_KEY from:
- Environment variable (if set)
~/.config/transcribe/.envfile (fallback)
Stage 6: Output
Saves raw.txt, dict_corrected.txt, corrected.txt, and metadata.json. Each intermediate file serves as a resume point — if the pipeline fails at stage 5, the raw transcript is preserved.
Configuration Reference
config.json
| Key | Default | Description |
|---|---|---|
models.norwegian | nb-whisper-medium-mlx | Model for Norwegian audio |
models.norwegian_fallback | aalst/nb-whisper-large-distil-turbo-beta-mlx | Fallback if local model not converted |
models.english | mlx-community/whisper-large-v3-turbo | Model for English audio |
models.multilingual | mlx-community/whisper-large-v3-turbo | Model for auto-detection and mixed audio |
local_model_dir | ~/.config/transcribe/models | Where converted models are stored |
language | auto | Default language mode (no, en, auto) |
auto_detect_segments | 5 | Number of VAD segments used for language detection |
auto_detect_norwegian_threshold | 0.8 | Ratio of Norwegian-detected segments to use NB-Whisper |
vad_enabled | true | Enable VAD segmentation (strongly recommended) |
vad_chunk_max_seconds | 28 | Maximum chunk duration in seconds |
vad_padding_ms | 400 | Padding added to each side of speech segments |
llm_correction | true | Enable Claude LLM correction pass |
llm_model | claude-sonnet-4-20250514 | Claude model for LLM correction |
llm_chunk_words | 500 | Words per LLM correction chunk |
output_dir | /tmp/transcribe | Base output directory |
dictionary.json
| Key | Description |
|---|---|
initial_prompts | Per-language natural sentences with domain vocabulary (max 224 tokens) |
replacements | Regex-based case-insensitive find/replace pairs |
context_hints | Per-language lists of context lines for LLM correction prompts |
Customizing for a Domain
To adapt the pipeline for a new domain:
-
Add initial prompts with relevant vocabulary as natural sentences:
"no": "Velkommen til styremøte i Equinor. Vi diskuterer havvind, karbonfangst, og produksjonsrapporter." -
Add replacements for words you know Whisper consistently misrecognizes:
"equi nor": "Equinor", "havvin": "havvind" -
Add context hints so the LLM correction understands the domain:
"no": [ "This is a Norwegian board meeting at an energy company", "Key terms: havvind, karbonfangst, CCS, produksjonsvolum" ]
Performance
Benchmarked on a 54-minute Norwegian meeting recording (M1 MacBook Pro 16GB):
| Metric | Value |
|---|---|
| Audio duration | 3,262s (54 min) |
| Transcription time | 299s |
| LLM correction time | 41s |
| Total time | 340s |
| Processing speed | 9.6x realtime |
| VAD segments | 140 chunks (from 687 raw) |
| Word count | 6,975 (corrected) |
| Hallucinations | Zero |
| Brand name accuracy | 100% (via dictionary) |
Comparison with Other Approaches
| Approach | Hallucinations | Brand Names | Words |
|---|---|---|---|
| This pipeline | None | 100% | 6,975 |
| OpenAI Whisper API | None | 0% | 7,974 |
| MLX basic (single pass) | Severe (44x repeats) | 0% | 7,973 |
| whisper-large-v3-turbo | Moderate (7x repeats) | 0% | 9,077 |
Critical Gotchas
1. mlx_whisper uses kwargs, not a dict
# WRONG — causes TypeError
mlx_whisper.transcribe(path, decode_options={"language": "no"})
# CORRECT — spread as kwargs
mlx_whisper.transcribe(path, language="no", best_of=1)
The mlx_whisper.transcribe() signature uses **decode_options (variadic kwargs), not decode_options: dict. The docstring is misleading.
2. mlx_whisper does NOT implement beam search
# WRONG — raises NotImplementedError
mlx_whisper.transcribe(path, beam_size=1)
# CORRECT — omit beam_size entirely (greedy decoding only)
mlx_whisper.transcribe(path)
Any non-None beam_size triggers NotImplementedError: Beam search decoder is not yet implemented. Greedy decoding (the default) is equivalent to beam_size=1.
3. torchaudio has no audio backends on macOS
# WRONG — fails with "No audio backend is available"
import torchaudio
wav, sr = torchaudio.load("audio.wav")
# CORRECT — use scipy instead
from scipy.io import wavfile
sr, audio_np = wavfile.read("audio.wav")
audio_float = audio_np.astype(np.float32) / 32768.0 # int16 → float32
wav = torch.from_numpy(audio_float)
torchaudio.list_audio_backends() returns [] on macOS with torchaudio 2.8.0. Use scipy.io.wavfile for WAV loading.
4. MLX cache API is deprecated
# WRONG — deprecated
mx.metal.set_cache_limit(100_000_000)
mx.metal.clear_cache()
# CORRECT — new API
mx.set_cache_limit(100_000_000)
mx.clear_cache()
5. ANTHROPIC_API_KEY not in subprocess environment
The Python process spawned by the agent may not have the API key in its environment. The pipeline loads from ~/.config/transcribe/.env as fallback. If LLM correction fails with auth errors, check:
# Verify key exists
cat ~/.config/transcribe/.env
# Or set in environment before running
export ANTHROPIC_API_KEY="sk-ant-..."
6. Temperature must be a tuple, not a float
# WRONG — no fallback recovery
temperature=0.0
# CORRECT — fallback tuple for when model gets stuck
temperature=(0.0, 0.2, 0.4, 0.6, 0.8, 1.0)
When Whisper gets stuck on a chunk (high compression ratio), it retries with progressively higher temperatures.
7. initial_prompt must be natural text
# WRONG — term list (Whisper treats as transcript prefix)
initial_prompt="Claude Code, Anthropic, MCP, CoPilot, Vercel"
# CORRECT — natural sentence (max 224 tokens)
initial_prompt="Velkommen til møte om teknologi. Vi diskuterer Claude Code, Anthropic, CoPilot og MCP-servere."
Whisper uses the initial prompt as if it were previous transcript text. A term list creates unnatural conditioning. A natural sentence biases vocabulary without distorting output.
8. NB-Whisper Medium MLX needs conversion
The NB-Whisper Medium model (NbAiLab/nb-whisper-medium) is in HuggingFace format, not MLX. Until converted, the pipeline falls back to aalst/nb-whisper-large-distil-turbo-beta-mlx. To convert:
$PYTHON ~/.config/transcribe/convert_model.py
This creates ~/.config/transcribe/models/nb-whisper-medium-mlx/. The fallback model works well — conversion is optional but may improve Norwegian WER.
Supported Audio Formats
Any format ffmpeg can decode: .m4a, .mp3, .wav, .ogg, .flac, .aac, .wma, .webm, .opus, .amr, .mp4 (audio track).
Evidence Basis
The architecture decisions are backed by empirical research:
| Decision | Evidence | Source |
|---|---|---|
| VAD preprocessing | Reduces hallucinations from ~20% to 0.2% | Baranski et al., 2025 (arXiv:2501.11378) |
| 28s chunk length | Matches Whisper's 30s internal window | NB-Whisper recommendation, WhisperX design |
| Greedy decoding (beam=1) | beam=1 → 19.5% hallucination vs beam=5 → 28-37% | Baranski et al., 2025 |
| NB-Whisper for Norwegian | WER 7.2% vs OpenAI Large-v3 10.4% | NB-Whisper benchmarks |
| condition_on_previous=False | Prevents hallucination propagation between chunks | Standard Whisper best practice |
| Temperature fallback tuple | Recovery when model stuck on dialect variation | Default Whisper behavior |