qwen3-tts-mlx

Local Qwen3-TTS speech synthesis on Apple Silicon via MLX. Use for offline narration, audiobooks, video voiceovers, and multilingual TTS.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "qwen3-tts-mlx" with this command: npx skills add agiseek/agent-skills/agiseek-agent-skills-qwen3-tts-mlx

Qwen3-TTS MLX

Run Qwen3-TTS locally on Apple Silicon (M1/M2/M3/M4) using MLX. Supports 11 languages, 9 built-in voices, voice cloning, and voice design from text descriptions.

When to Use

  • Generate speech fully offline on a Mac
  • Produce narration, audiobooks, podcasts, or video voiceovers
  • Create multilingual TTS with controllable style and emotion
  • Clone any voice from a short audio sample
  • Design custom voices from text descriptions

Quick Start

Install

pip install mlx-audio
brew install ffmpeg

Basic Usage

python scripts/run_tts.py custom-voice \
  --text "Hello, welcome to local text to speech." \
  --voice Ryan \
  --output output.wav

With Style Control

python scripts/run_tts.py custom-voice \
  --text "Breaking news: local AI model achieves human-level speech." \
  --voice Uncle_Fu \
  --instruct "news anchor tone, calm and authoritative" \
  --output news.wav

Model Variants

VariantModelSizeMemoryUse Case
CustomVoicemlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit~1GB~4GBBuilt-in voices + style control (recommended)
VoiceDesignmlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit~2GB~5GBCreate voices from text descriptions
Basemlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit~1GB~4GBVoice cloning from reference audio

Supported Languages

LanguageCodeNotes
Auto-detectautoDefault, detects from text
ChineseChineseMandarin
EnglishEnglish
JapaneseJapanese
KoreanKorean
FrenchFrench
GermanGerman
SpanishSpanish
PortuguesePortuguese
ItalianItalian
RussianRussian

Built-in Voices

VoiceLanguageCharacter
VivianChineseFemale, bright, young
SerenaChineseFemale, gentle, soft
Uncle_FuChineseMale, authoritative, news anchor
DylanChineseMale, Beijing dialect
EricChineseMale, Sichuan dialect
RyanEnglishMale, energetic
AidenEnglishMale, clear, neutral
Ono_AnnaJapaneseFemale
SoheeKoreanFemale

Voice Selection Guide:

ScenarioRecommended Voice
Chinese news/narrationUncle_Fu
Chinese casual/livelyEric
Chinese female, professionalVivian
Chinese female, storytellingSerena
English energetic contentRyan
English neutral/educationalAiden
Japanese contentOno_Anna
Korean contentSohee

Modes

1) CustomVoice

Use built-in voices with optional emotion/style control via --instruct.

python scripts/run_tts.py custom-voice \
  --text "This is amazing news!" \
  --voice Vivian \
  --instruct "excited and happy" \
  --output excited.wav

Style instruction examples:

  • "calm and warm" - Soft, friendly delivery
  • "news anchor, authoritative" - Professional broadcast style
  • "excited and energetic" - High energy, enthusiastic
  • "sad and melancholic" - Emotional, somber tone
  • "whispering, intimate" - Quiet, close-mic feel

2) VoiceDesign

Create a completely new voice by describing it in natural language.

python scripts/run_tts.py voice-design \
  --text "Welcome to our podcast." \
  --instruct "warm, mature male narrator with low pitch and gentle tone" \
  --output podcast_intro.wav

Voice description examples:

  • "young cheerful female with high pitch"
  • "elderly wise male with deep resonant voice"
  • "professional female news anchor, clear articulation"
  • "friendly young male, casual and relaxed"

3) VoiceClone

Clone any voice from a reference audio sample (5-10 seconds recommended).

python scripts/run_tts.py voice-clone \
  --text "This is my cloned voice speaking new content." \
  --ref_audio reference.wav \
  --ref_text "The exact transcript of the reference audio" \
  --output cloned.wav

Tips for voice cloning:

  • Use clean audio without background noise
  • 5-10 seconds of speech works best
  • Provide accurate transcript of the reference
  • Reference and output language should match

CLI Parameters

ParameterRequiredDefaultDescription
--textYes-Text to synthesize
--voiceNoVivianBuilt-in voice (CustomVoice only)
--lang_codeNoautoLanguage code
--instructNo-Style control or voice description
--speedNo1.0Speech speed multiplier
--temperatureNo0.7Sampling temperature (higher = more variation)
--modelNo(per mode)Override default model
--outputNo-Output file path
--out-dirNo./outputsOutput directory when --output not set
--ref_audioVoiceClone-Reference audio file
--ref_textVoiceClone-Reference audio transcript

Python API

Using generate_audio (recommended)

from mlx_audio.tts.generate import generate_audio

# CustomVoice with style control
generate_audio(
    text="Hello from Qwen3-TTS!",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit",
    voice="Ryan",
    lang_code="english",
    instruct="friendly and warm",
    output_path=".",
    file_prefix="hello",
    audio_format="wav",
    join_audio=True,
    verbose=True,
)

Using Model directly

from mlx_audio.tts.utils import load
import soundfile as sf
import numpy as np

# Load model
model = load("mlx-community/Qwen3-TTS-12Hz-0.6B-CustomVoice-4bit")

# Generate audio (returns a generator)
audio_chunks = []
for chunk in model.generate_custom_voice(
    text="Hello from Qwen3-TTS.",
    speaker="Ryan",
    language="english",
    instruct="clear, steady delivery"
):
    if hasattr(chunk, 'audio') and chunk.audio is not None:
        audio_chunks.append(chunk.audio)

# Combine and save
audio = np.concatenate(audio_chunks)
sf.write("output.wav", audio, 24000)

VoiceDesign

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="Welcome to the show.",
    model="mlx-community/Qwen3-TTS-12Hz-1.7B-VoiceDesign-5bit",
    instruct="warm, friendly female narrator with medium pitch",
    lang_code="english",
    output_path=".",
    file_prefix="voice_design",
    join_audio=True,
)

VoiceClone

from mlx_audio.tts.generate import generate_audio

generate_audio(
    text="New content in the cloned voice.",
    model="mlx-community/Qwen3-TTS-12Hz-0.6B-Base-4bit",
    ref_audio="reference.wav",
    ref_text="Transcript of the reference audio",
    output_path=".",
    file_prefix="cloned",
    join_audio=True,
)

Batch Processing

Use scripts/batch_dubbing.py for processing multiple lines:

python scripts/batch_dubbing.py \
  --input dubbing.json \
  --out-dir outputs

See references/dubbing_format.md for the JSON format.

Performance

MetricValue
Sample rate24,000 Hz
Real-time factor~0.7x (faster than real-time)
Peak memory~4-6 GB
First runDownloads model (~1-2GB)

Troubleshooting

IssueSolution
Slow generationUse 4-bit CustomVoice model
Unnatural pausesAdd punctuation, keep sentences short
Wrong language detectedSpecify --lang_code explicitly
Voice cloning qualityUse cleaner reference audio, accurate transcript
Tokenizer warningsHarmless, can be ignored
Out of memoryClose other apps, use 4-bit model

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

gemini-watermark

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

vercel-composition-patterns

React composition patterns that scale. Use when refactoring components with boolean prop proliferation, building flexible component libraries, or designing reusable APIs. Triggers on tasks involving compound components, render props, context providers, or component architecture. Includes React 19 API changes.

Repository Source
86K23Kvercel
Automation

vercel-react-native-skills

React Native and Expo best practices for building performant mobile apps. Use when building React Native components, optimizing list performance, implementing animations, or working with native modules. Triggers on tasks involving React Native, Expo, mobile performance, or native platform APIs.

Repository Source
60.3K23Kvercel
Automation

supabase-postgres-best-practices

Postgres performance optimization and best practices from Supabase. Use this skill when writing, reviewing, or optimizing Postgres queries, schema designs, or database configurations.

Repository Source
35.1K1.6Ksupabase