qwen3-audio

High-performance audio library for Apple Silicon with text-to-speech (TTS) and speech-to-text (STT).

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "qwen3-audio" with this command: npx skills add darknoah/qwen3-audio

Qwen3-Audio

Overview

Qwen3-Audio is a high-performance audio processing library optimized for Apple Silicon (M1/M2/M3/M4). It delivers fast, efficient TTS and STT with support for multiple models, languages, and audio formats.

Prerequisites

  • Python 3.10+
  • Apple Silicon Mac (M1/M2/M3/M4)

Environment checks

Before using any capability, verify that all items in ./references/env-check-list.md are complete.

Capabilities

Text to Speech

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav"

Returns (JSON):

{
  "audio_path": "/path_to_save.wav",
  "duration": 1.234,
  "sample_rate": 24000
}

Voice Cloning

Clone any voice using a reference audio sample. Provide the wav file and its transcript:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_audio "sample_audio.wav" --ref_text "This is what my voice sounds like."

ref_audio: reference audio to clone ref_text: transcript of the reference audio

Use Created Voice (Shortcut)

Use a voice created with voice create by its ID:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --ref_voice "my-voice-id"

This automatically loads ref_audio and ref_text from the voice profile.

CustomVoice (Emotion Control)

Use predefined voices with emotion/style instructions:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --speaker "Ryan" --language "English" --instruct "Very happy and excited."

VoiceDesign (Create Any Voice)

Create any voice from a text description:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "hello world" --output "/path_to_save.wav" --language "English" --instruct "A cheerful young female voice with high pitch and energetic tone."

Automatic Speech Recognition (STT)

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" stt --audio "/sample_audio.wav" --output "/path_to_save.txt" --output-format srt

Test audio: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-ASR-Repo/asr_en.wav output-format: "txt" | "ass" | "srt" | "all"

Returns (JSON):

{
  "text": "transcribed text content",
  "duration": 10.5,
  "sample_rate": 16000,
  "files": ["/path_to_save.txt", "/path_to_save.srt"]
}

Voice Management

Voices are stored in the voices/ directory at the skill root level. Each voice has its own folder containing:

  • ref_audio.wav - Reference audio file
  • ref_text.txt - Reference text transcript
  • ref_instruct.txt - Voice style description

Create a Voice

Create a reusable voice profile using VoiceDesign model. The --instruct parameter is required to describe the voice style:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" voice create --text "This is a sample voice reference text." --instruct "A warm, friendly female voice with a professional tone." --language "English"

Optional: --id "my-voice-id" to specify a custom voice ID.

Returns (JSON):

{
  "id": "abc12345",
  "ref_audio": "/path/to/skill/voices/abc12345/ref_audio.wav",
  "ref_text": "This is a sample voice reference text.",
  "instruct": "A warm, friendly female voice with a professional tone.",
  "duration": 3.456,
  "sample_rate": 24000
}

List Voices

List all created voice profiles:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" voice list

Returns (JSON):

[
  {
    "id": "abc12345",
    "ref_audio": "/path/to/skill/voices/abc12345/ref_audio.wav",
    "ref_text": "This is a sample voice reference text.",
    "instruct": "A warm, friendly female voice with a professional tone.",
    "duration": 3.456,
    "sample_rate": 24000
  }
]

Use a Created Voice

After creating a voice, use it for TTS with the --ref_voice parameter. The instruct will be automatically loaded:

uv run --python ".venv/bin/python" "./scripts/mlx-audio.py" tts --text "New text to speak" --output "/output.wav" --ref_voice "abc12345"

Predefined Speakers (CustomVoice)

For Qwen3-TTS-12Hz-1.7B/0.6B-CustomVoice models, the supported speakers and their descriptions are listed below. We recommend using each speaker's native language for best quality. Each speaker can still speak any language supported by the model.

SpeakerVoice DescriptionNative Language
VivianBright, slightly edgy young female voice.Chinese
SerenaWarm, gentle young female voice.Chinese
Uncle_FuSeasoned male voice with a low, mellow timbre.Chinese
DylanYouthful Beijing male voice with a clear, natural timbre.Chinese (Beijing Dialect)
EricLively Chengdu male voice with a slightly husky brightness.Chinese (Sichuan Dialect)
RyanDynamic male voice with strong rhythmic drive.English
AidenSunny American male voice with a clear midrange.English
Ono_AnnaPlayful Japanese female voice with a light, nimble timbre.Japanese
SoheeWarm Korean female voice with rich emotion.Korean

Released Models

ModelFeaturesLanguage SupportInstruction Control
Qwen3-TTS-12Hz-1.7B-VoiceDesignPerforms voice design based on user-provided descriptions.Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Qwen3-TTS-12Hz-1.7B-CustomVoiceProvides style control over target timbres via user instructions; supports 9 premium timbres covering various combinations of gender, age, language, and dialect.Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian
Qwen3-TTS-12Hz-1.7B-BaseBase model capable of 3-second rapid voice clone from user audio input; can be used for fine-tuning (FT) other models.Chinese, English, Japanese, Korean, German, French, Russian, Portuguese, Spanish, Italian

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

public-media-curator

On-demand German public-media documentary picks filtered against a personal profile, delivered via the configured output channel

Registry SourceRecently Updated
General

Red Team Verificator

Аудит стратегических анализов стартапов с фокусом на поиск ошибок, галлюцинаций и искажений в данных и Red Team-отчетах.

Registry SourceRecently Updated
General

AI Product Manager

OpenClaw-first AI product manager for turning analytics, revenue, crash, store, and feedback signals into execution-ready proposals and backlog work.

Registry SourceRecently Updated
General

Startup Verifier

Проверяет стартапы по 14 критериям для адаптации на российский рынок и выдает вердикт: GREEN, YELLOW или RED.

Registry SourceRecently Updated