local-tts

Local text-to-speech using Qwen3-TTS with mlx_audio (macOS Apple Silicon) or qwen-tts (Linux/Windows). Privacy-first offline TTS with natural, realistic voice cloning and voice design. Use for local, secure, high-quality multilingual speech synthesis.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "local-tts" with this command: npx skills add irachex/local-tts

Local TTS with Qwen3-TTS

Privacy-First | Offline | High-Quality | Natural Real Voices

Local text-to-speech synthesis using Qwen3-TTS models. Your text never leaves your machine.

Why Local TTS?

Unlike cloud TTS (Google, AWS, Azure), local-tts ensures:

  • Zero data transmission - 100% on-device processing
  • Works offline - No network required
  • No API keys - No external dependencies
  • GDPR/HIPAA friendly - Simplified compliance

See privacy & security details.

Platform Overview

PlatformBackendInstallationBest For
macOS (Apple Silicon)mlx_audiopip install mlx-audioM1/M2/M3/M4 Macs
Linux/Windowsqwen-ttspip install qwen-ttsCUDA GPUs

Quick Start

macOS

pip install mlx-audio
brew install ffmpeg

# Natural female voice
python -m mlx_audio.tts.generate \
    --text "Hello world" \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
    --voice Chelsie

Linux/Windows

pip install qwen-tts

# With optimizations (FlashAttention, bfloat16, auto-device)
python scripts/tts_linux.py "Hello world" --female

Key Concepts

--voice vs --instruct (Important)

Model--voice--instructNotes
CustomVoiceSelect preset voiceAdd style/emotionCan use together - voice + style control
VoiceDesignN/ACreate voice from description--instruct only
BaseN/AN/AFor voice cloning with --ref_audio

CustomVoice with style control:

python -m mlx_audio.tts.generate \
    --text "Hello there!" \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
    --voice Serena \
    --instruct "excited and enthusiastic"

9 Preset Voices (Open Source CustomVoice)

VoiceGenderLanguageCharacter
ChelsieFemaleEnglish (American)Gentle, empathetic
SerenaFemaleEnglishWarm, gentle
Ono AnnaFemaleJapanesePlayful
SoheeFemaleKoreanWarm
AidenMaleEnglish (American)Sunny
DylanMaleEnglishNatural
EricMaleEnglishReal
RyanMaleEnglishNatural
Uncle FuMaleChineseYouthful Beijing

Defaults: Female=Serena, Male=Aiden

Usage Examples

CustomVoice (Preset Voices)

# Natural female
python -m mlx_audio.tts.generate \
    --text "Your text" --voice Serena --lang_code en \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit

# Real male
python -m mlx_audio.tts.generate \
    --text "Your text" --voice Aiden --lang_code en \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit

VoiceDesign (Text-Based)

python -m mlx_audio.tts.generate \
    --text "Hello" \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-8bit \
    --instruct "A warm female voice, professional and clear"

Long Text Generation

For long text, increase --max_tokens and enable --join_audio (macOS/MLX only):

python -m mlx_audio.tts.generate \
    --text "Your very long text here..." \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-CustomVoice-8bit \
    --voice Serena \
    --max_tokens 4096 \
    --join_audio \
    --output long_audio.wav

Voice Cloning

python -m mlx_audio.tts.generate \
    --text "Cloned voice speaking" \
    --model mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit \
    --ref_audio sample.wav --ref_text "Sample transcript"

Parameters

ParameterDescriptionValues
--textText to speakRequired
--modelModel IDSee table below
--voicePreset voice (CustomVoice)Chelsie, Serena, Aiden, Ryan...
--instructVoice description (VoiceDesign) or style/emotion (CustomVoice)e.g., "excited", "calm", "professional"
--speedSpeaking rate0.5-2.0 (default: 1.0)
--pitchVoice pitch0.5-2.0 (default: 1.0)
--lang_codeLanguageen, cn, ja, ko, de, fr...
--ref_audioReference for cloningFile path
--outputOutput filePath (auto-generated if omitted)
--max_tokensMax generation tokensInteger (default: 2048) - Increase for long text
--join_audioMerge audio segmentstrue (default) or false - Recommended for long text

Models

ModelSizePurpose
Qwen3-TTS-12Hz-1.7B-CustomVoice1.7B9 preset voices + style control
Qwen3-TTS-12Hz-1.7B-VoiceDesign1.7BText-based voice creation
Qwen3-TTS-12Hz-1.7B-Base1.7BVoice cloning
Qwen3-TTS-12Hz-0.6B-*0.6BLightweight versions

macOS: Add mlx-community/ prefix (e.g., mlx-community/Qwen3-TTS-12Hz-1.7B-Base-8bit)

Scripts

  • scripts/tts_macos.py - macOS wrapper
  • scripts/tts_linux.py - Linux/Windows wrapper with optimizations

Optimizations (Linux/Windows)

tts_linux.py automatically enables:

  • FlashAttention - Faster, less memory
  • bfloat16 - Better precision
  • Auto device - CUDA → CPU fallback
  • Mixed precision - Speed + quality

Troubleshooting

IssueSolution
macOS: Model not foundUse mlx-community/ prefix
macOS: Audio formatbrew install ffmpeg
Linux: CUDA OOMUse 0.6B models
Linux: SlowCheck CUDA: torch.cuda.is_available()

References

Version

1.0.0 - See VERSION and package.json

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Cclaw

Open-source comedy AI + video editing + poster generation. Create standup/sketch/manzai/scripts, edit videos via FFmpeg, and generate comedy posters via canv...

Registry SourceRecently Updated
General

Dlazy Seedance 1.5 Pro

Convert images into dynamic dance videos using Doubao Seedance 1.5 Pro.

Registry SourceRecently Updated
General

Pod Template Pack

Use when user needs ready-to-use POD (Print on Demand) design keywords, title templates, and listing copy. Use when creating POD product listings for TikTok,...

Registry SourceRecently Updated
General

Dlazy Mj.Imagine

Generate artistic images using Midjourney (MJ) model. Supports text-to-image.

Registry SourceRecently Updated