qwen3-tts

High-quality text-to-speech using Qwen3-TTS. 10 built-in speakers with emotional instruct control, voice cloning (3s of audio), natural-language voice design, 10+ languages, persistent named voices, and delivering audio via Telegram/WhatsApp as native voice messages. Auto-detects GPU hardware (CUDA, ROCm, Intel XPU, CPU).

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "qwen3-tts" with this command: npx skills add damustermann/claw-qwen3-tts

Qwen3-TTS Skill

You have access to a powerful text-to-speech system that can generate human-quality speech with 10 built-in speakers, design new voices from descriptions, clone existing voices from audio samples, and send audio via Telegram/WhatsApp as native voice messages.

First-Time Setup

If the skill is not yet installed (no ~/clawd/skills/qwen3-tts directory), run:

bash <(curl -fsSL https://raw.githubusercontent.com/daMustermann/claw-qwen3-tts/main/install.sh)

Or if already cloned but not set up (no .venv/ directory):

bash ~/clawd/skills/qwen3-tts/install.sh

This auto-detects the GPU (CUDA, ROCm, Intel XPU, or CPU-only), creates a Python venv, and installs all dependencies. It takes 5–15 minutes on first run.

Starting & Stopping the Server

Before any TTS operation, ensure the server is running:

# Start (idempotent — won't restart if already running)
bash ~/clawd/skills/qwen3-tts/scripts/start_server.sh

# Check health
bash ~/clawd/skills/qwen3-tts/scripts/health_check.sh

# Stop (when done)
bash ~/clawd/skills/qwen3-tts/scripts/stop_server.sh

The server runs at http://localhost:8880.


Available Models

Model IDUse CaseNotes
custom-voice-1.7bHigh-quality TTS with built-in speakers — defaultBest quality, ~5 GB VRAM
custom-voice-0.6bFast TTS with built-in speakersLightweight, ~2 GB VRAM
voice-designDesign new voices from natural language descriptionsUses VoiceDesign model
base-1.7bBasic TTS (auto-corrected to custom-voice-1.7b)Use custom-voice-* instead
base-0.6bBasic TTS (auto-corrected to custom-voice-0.6b)Use custom-voice-* instead

Important: On the /v1/audio/speech endpoint, base-* and voice-design models are automatically corrected to the corresponding custom-voice-* model. Always prefer custom-voice-1.7b or custom-voice-0.6b for speech generation.

Built-in Speakers

The custom-voice-* models include 10 built-in voices:

Chelsie · Ethan · Aidan · Serena · Ryan · Vivian · Claire · Lucas · Eleanor · Benjamin

You can discover speakers dynamically: curl http://localhost:8880/v1/speakers


Capabilities

1. Generate Speech from Text

When to use: User asks to speak text, read something aloud, generate audio, do a voiceover, narrate, or say something.

curl -X POST http://localhost:8880/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "custom-voice-1.7b",
    "input": "TEXT_HERE",
    "voice": "default",
    "speaker": "Chelsie",
    "language": "en",
    "instruct": "",
    "response_format": "wav"
  }' \
  --output ~/clawd/skills/qwen3-tts/output/speech.wav

Parameters:

ParameterRequiredDefaultDescription
modelnocustom-voice-1.7bTTS model to use
inputyesThe text to synthesize
voicenodefault"default" for built-in speakers, or a saved voice name (e.g. "Angie")
speakernoChelsieBuilt-in speaker name (only when voice is "default")
languagenoenLanguage code: en, zh, ja, ko, de, fr, ru, pt, es, it
instructno""Emotional/style instruction (see below)
response_formatnowavOutput format: wav, mp3, ogg, flac
speedno1.0Speech speed multiplier

Language codes: en, zh, ja, ko, de, fr, ru, pt, es, it — or full names like English, Chinese, German, etc.

Instruct examples (controls tone, emotion, and style):

  • "Speak happily and with excitement"
  • "Whisper softly, as if telling a secret"
  • "Read this in a calm, professional news anchor tone"
  • "用愤怒的语气" (Speak angrily — works in target language too)
  • "" (empty string = neutral default)

When voice is a saved name: If you pass "voice": "Angie" and a voice named "Angie" exists, the server uses voice cloning with the saved reference audio instead of a built-in speaker. The speaker field is ignored in this case.

2. Design a New Voice

When to use: User wants to create a custom voice, describe how a character should sound, design a persona's voice.

curl -X POST http://localhost:8880/v1/audio/voice-design \
  -H "Content-Type: application/json" \
  -d '{
    "model": "voice-design",
    "input": "TEXT_TO_SPEAK",
    "voice_description": "DESCRIBE THE VOICE IN NATURAL LANGUAGE",
    "language": "en",
    "response_format": "wav"
  }' \
  --output ~/clawd/skills/qwen3-tts/output/designed.wav

Parameters:

ParameterRequiredDefaultDescription
modelnovoice-designMust be voice-design
inputyesText to synthesize with the designed voice
voice_descriptionyesNatural language description of the desired voice
languagenoenTarget language
response_formatnowavOutput format

Example descriptions:

  • "A warm, deep male voice with a slight British accent, calm and authoritative, like a BBC presenter in his 40s"
  • "A young, energetic female voice, bright and cheerful, with a slight rasp"
  • "An old wizard with a slow, mysterious, gravelly voice"

The response includes a X-Voice-Id header — capture it to save the voice (see §4).

3. Clone a Voice

When to use: User provides a reference audio clip and wants to generate new speech in that voice.

curl -X POST http://localhost:8880/v1/audio/voice-clone \
  -F "reference_audio=@/path/to/reference.wav" \
  -F "reference_text=Transcript of the reference audio" \
  -F "input=New text to speak in the cloned voice" \
  -F "language=en" \
  -F "response_format=wav" \
  --output ~/clawd/skills/qwen3-tts/output/cloned.wav

Parameters:

ParameterRequiredDefaultDescription
reference_audioyesAudio file to clone the voice from
inputyesNew text to synthesize in the cloned voice
reference_textno""Transcription of the reference audio (improves quality)
languagenoenTarget language
response_formatnowavOutput format

Guidelines:

  • Minimum 3 seconds of reference audio
  • Recommended 10–30 seconds for best quality
  • Providing an accurate reference_text transcription significantly improves results
  • Supports cross-language cloning (clone from English → speak in Japanese)
  • If reference_text is empty, uses x-vector-only mode (audio features only)

The response includes a X-Voice-Id header — capture it to save the voice (see §4).

4. ⭐ CRITICAL: Voice Save Prompting Rules

YOU MUST FOLLOW THESE RULES:

  1. After EVERY voice-design or voice-clone request, ask the user:

    "Would you like to save this voice for future use? What name should I give it?"

  2. If the user says yes, capture the X-Voice-Id from the response headers and save it:

    curl -X POST http://localhost:8880/v1/voices \
      -H "Content-Type: application/json" \
      -d '{
        "name": "USER_CHOSEN_NAME",
        "source_voice_id": "VOICE_ID_FROM_X_VOICE_ID_HEADER",
        "description": "Description of the voice",
        "tags": ["tag1", "tag2"],
        "language": "en"
      }'
    
  3. When user requests TTS with a voice name (e.g. "say this with Angie"):

    • Use "voice": "Angie" in the /v1/audio/speech request
    • The server automatically loads the saved reference audio and uses voice cloning
    • If the name doesn't exist, tell the user and offer to design or clone one
  4. When user asks to list voices:

    curl http://localhost:8880/v1/voices
    

    Present the results as a formatted list with name, description, source, language, tags, and usage count. Voices are sorted by usage count (most used first).

  5. When user asks to delete a voice: Confirm with the user first, then:

    curl -X DELETE http://localhost:8880/v1/voices/VOICE_NAME
    
  6. When user asks to rename a voice:

    curl -X PATCH http://localhost:8880/v1/voices/OLD_NAME \
      -H "Content-Type: application/json" \
      -d '{"name": "NEW_NAME"}'
    
  7. When user asks to update a voice's metadata (description, tags, language):

    curl -X PATCH http://localhost:8880/v1/voices/VOICE_NAME \
      -H "Content-Type: application/json" \
      -d '{"description": "Updated description", "tags": ["new", "tags"]}'
    
  8. Voice names are case-insensitive but stored in the casing the user provided.

  9. No duplicate names allowed. If a name already exists, the save will fail (409). Ask the user for a different name or offer to delete the existing one first.

  10. Voice profiles are stored locally in ~/clawd/skills/qwen3-tts/voices/ and persist across server restarts. Each voice consists of:

    • <name>.json — metadata
    • <name>.pt — embedding tensor
    • <name>_sample.wav — reference audio sample (used for re-cloning)

5. Convert Audio Formats

When to use: User needs audio in a specific format, or you need to prepare audio for messaging.

curl -X POST http://localhost:8880/v1/audio/convert \
  -F "audio=@input.wav" \
  -F "target_format=mp3" \
  --output output.mp3

Supported formats: wav, mp3, ogg (Opus), flac

You can also use the shell script directly:

bash ~/clawd/skills/qwen3-tts/scripts/convert_to_ogg_opus.sh input.wav output.ogg

6. Send via Telegram (PTT Voice Message)

When to use: User is interacting via Telegram, or explicitly asks to send audio to a Telegram chat.

curl -X POST http://localhost:8880/v1/audio/send/telegram \
  -H "Content-Type: application/json" \
  -d '{
    "audio_file": "/path/to/audio.wav",
    "chat_id": "CHAT_ID",
    "bot_token": "BOT_TOKEN",
    "caption": "Optional caption"
  }'
  • bot_token is optional if already configured in config.json
  • Audio is auto-converted to OGG/Opus and sent via Telegram's sendVoice API
  • Displays as a native PTT waveform voice message in the chat

7. Send via WhatsApp (PTT Voice Message)

When to use: User is interacting via WhatsApp, or explicitly asks to send audio there.

curl -X POST http://localhost:8880/v1/audio/send/whatsapp \
  -H "Content-Type: application/json" \
  -d '{
    "audio_file": "/path/to/audio.wav",
    "phone_number_id": "PHONE_ID",
    "recipient": "+14155551234",
    "access_token": "ACCESS_TOKEN"
  }'
  • phone_number_id and access_token are optional if already configured in config.json
  • Audio is auto-converted to OGG/Opus and sent as a native WhatsApp voice message

8. Discovery Endpoints

Use these to dynamically discover available models and speakers:

# List all available TTS models
curl http://localhost:8880/v1/models

# List built-in speakers
curl http://localhost:8880/v1/speakers

# Server health check (device info, voice count, version)
curl http://localhost:8880/health

How to Respond

After generating speech:

  1. Tell the user the audio has been generated
  2. Provide the output file path
  3. If it was voice-design or voice-clone, always ask to save the voice (Rule §4.1)
  4. If the user is on Telegram/WhatsApp, offer to send it as a voice message

After saving a voice:

  • Confirm the name and tell the user they can use it anytime with that name
  • Example: "Voice saved as 'Captain Hook'! You can reference it anytime with voice: Captain Hook."

After sending via Telegram/WhatsApp:

  • Confirm successful delivery

When choosing a speaker: If the user doesn't specify, default to "Chelsie". If they describe the kind of voice they want (but not a full voice-design request), pick the most fitting built-in speaker.

When choosing a model: Default to custom-voice-1.7b. Only use custom-voice-0.6b if the user asks for speed, or if the system has limited VRAM/memory.


Configuration

The agent can update ~/clawd/skills/qwen3-tts/config.json to set:

  • Telegram: bot token and default chat ID
  • WhatsApp: phone number ID and access token
  • Default model: custom-voice-1.7b or custom-voice-0.6b
  • Default audio format: wav, mp3, ogg, flac
  • Device override: auto, cuda:0, xpu:0, cpu

If config.json doesn't exist, copy the template:

cp ~/clawd/skills/qwen3-tts/config.json.template ~/clawd/skills/qwen3-tts/config.json

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

MiniMax TTS Generator

Text-to-speech (TTS) generation using MiniMax API. Converts text into natural-sounding speech with support for multiple voices, adjustable speed and pitch, a...

Registry SourceRecently Updated
1840Profile unavailable
General

Coze Tts

Text-to-Speech (TTS) using Coze API. Convert text to natural-sounding speech audio files. Supports multiple voices and output formats (mp3, ogg_opus, wav, pcm).

Registry SourceRecently Updated
3500Profile unavailable
General

Neomano TTS (ElevenLabs)

Text-to-speech (TTS) via ElevenLabs. Use when the user asks to reply with voice/audio, generate a spoken version of some text, or asks for “voz”, “nota de vo...

Registry SourceRecently Updated
2652Profile unavailable
General

Text to Speech

Generate speech audio from text using HeyGen's Starfish TTS model. Use when: (1) Generating standalone speech audio files from text, (2) Converting text to s...

Registry SourceRecently Updated
9111Profile unavailable