Local TTS Skill

Generate high-quality speech audio locally using Apple Silicon MLX acceleration and the Kokoro-82M model. No API keys or recurring costs.

Quick Start

# Generate MP3 from text
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --text "Hello, this is a test." \
    --output ~/Desktop/test.mp3

# Generate from file
uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --file /tmp/script.txt \
    --voice af_heart \
    --output ~/Desktop/podcast.mp3

# List available voices
uv run --with mlx-audio skills/local-tts/scripts/list_voices.py

Parameters

Parameter	Required	Default	Description
`--text`	One of text/file	-	Text to convert
`--file`	One of text/file	-	Path to text file
`--voice`	No	`af_heart`	Voice preset
`--output`	Yes	-	Output file path (.mp3, .wav)
`--model`	No	`Kokoro-82M-bf16`	Model to use
`--list-voices`	No	-	Show available voices

Voice Presets

American English Female (prefix: af_)

af_heart - Warm, friendly (default)
af_bella - Soft, calm
af_nova - Clear, professional
af_river - Clear, confident
af_sarah - Soft, expressive

American English Male (prefix: am_)

am_adam - Clear, professional
am_echo - Deep, smooth
am_liam - Articulate, conversational
am_michael - Soft, measured

British English (prefix: bf_, bm_)

bf_emma - Clear, refined female
bm_daniel - Clear, professional male
bm_george - Distinguished male

See references/voices.md for full list.

Output Format

{
  "success": true,
  "file": "/Users/hagelk/Desktop/podcast.mp3",
  "voice": "af_heart",
  "model": "Kokoro-82M-bf16",
  "characters": 9824,
  "chunks": 20,
  "duration_seconds": 612.5,
  "generation_time": 45.2
}

Performance

Hardware	Speed	Notes
M3 Pro 36GB	~3-4x realtime	First run slower (model loading)
M1/M2 Mac Mini 8GB	~1.5x realtime	Works well for briefings
M1/M2 Mac Mini 16GB	~2x realtime	Comfortable headroom

Technical Details

Model: Kokoro-82M-bf16 (~200MB download on first run)
Sample rate: 24kHz mono
Chunking: Text split at ~400 chars per chunk for quality
Concatenation: Chunks joined seamlessly via pydub
Formats: MP3, WAV, M4A, OGG

Important Notes

MUST use --with flags - Do not use PEP 723 inline deps. mlx-audio requires uv's cached environment.
First run is slower - Model downloads ~200MB and espeak dependencies initialize.
Model cached at: ~/.cache/huggingface/hub/models--mlx-community--Kokoro-82M-bf16/

Integration with Morning Briefing

The morning-briefing skill uses this for podcast generation:

uv run --with mlx-audio --with pydub skills/local-tts/scripts/generate_audio.py \
    --file /tmp/morning_briefing_podcast.txt \
    --voice af_heart \
    --output ~/Desktop/morning_briefing.mp3

local-tts

Safety Notice

Copy this and send it to your AI assistant to learn

Local TTS Skill

Quick Start

Parameters

Voice Presets

American English Female (prefix: af_)

American English Male (prefix: am_)

British English (prefix: bf_, bm_)

Output Format

Performance

Technical Details

Important Notes

Integration with Morning Briefing

Source Transparency

Related Skills

morning-briefing

browser-control

omnifocus-manager

personal-strategic-planning