Text2Speech Skill

Generate high-quality text-to-speech audio using Qwen3-TTS models.

Prerequisites

Python 3.8+
requests package
Access to TTSWeb API (https://mc.agaii.org/TTS)

Installation

Via npm (Node.js)

npm install -g @catfishw/text2speech-skill

Via pip (Python)

pip install git+https://github.com/CatfishW/TTSAgentSkill.git

Direct Usage

python3 -m text2speech_skill.cli --help

Quick Start

Speak with Preset Speaker

text2speech speak "Hello world" -s vivian -o hello.wav

Design Custom Voice

text2speech design "Welcome to the future" \
  -d "futuristic female AI assistant, clear and professional" \
  -o welcome.wav

Clone Voice from Audio

text2speech clone "This is my cloned voice speaking" \
  -a reference.wav \
  -r "original transcript of reference audio" \
  -o cloned.wav

Clone with Preset Timbre

text2speech clone "Hello" -t ryan -o output.wav

Commands

speak

Text-to-speech with preset speaker voices.

text2speech speak <text> [options]

Options:
  -s, --speaker     Speaker name (default: vivian)
  -l, --language    Language code (default: Auto)
  -i, --instruct    Style instruction (e.g., "speak cheerfully")
  -o, --output      Output audio file (required)

Speakers: vivian, ryan, aiden, dylan, eric, ono_anna, serena, sohee, uncle_fu

Examples:

text2speech speak "Hello" -s vivian -o hello.wav
text2speech speak "Bonjour" -s serena -l French -o bonjour.wav
text2speech speak "Hi" -s ryan -i "speak like a news anchor" -o hi.wav

design

Create voice from natural language description.

text2speech design <text> -d <description> [options]

Options:
  -d, --description  Voice description (required)
  -l, --language     Language code
  -o, --output       Output audio file (required)

Examples:

text2speech design "Hello" -d "old man with raspy voice" -o oldman.wav
text2speech design "Welcome" -d "young energetic female, enthusiastic" -o welcome.wav

clone

Clone voice from reference audio or preset timbre.

text2speech clone <text> [options]

Options:
  -a, --audio           Reference audio file
  -t, --timbre          Preset timbre speaker (alternative to audio)
  -r, --ref-text        Reference transcript (for ICL mode)
  -x, --x-vector-only   Use x-vector only mode
  -i, --instruct        Style instruction
  -l, --language        Language code
  -o, --output          Output audio file (required)

Examples:

# Clone from audio with transcript (ICL mode)
text2speech clone "Hello" -a ref.wav -r "original text" -o out.wav

# Clone from audio (x-vector only, faster)
text2speech clone "Hello" -a ref.wav -x -o out.wav

# Clone using preset timbre
text2speech clone "Hello" -t ryan -o out.wav

batch-speak

Batch process multiple text files.

text2speech batch-speak <input_dir> <output_dir> [options]

Options:
  -s, --speaker   Speaker name (default: vivian)
  -l, --language  Language code
  -i, --instruct  Style instruction

Input: Directory containing .txt files Output: Audio files + batch_report.json

Example:

mkdir -p texts output
echo "Hello" > texts/1.txt
echo "World" > texts/2.txt
text2speech batch-speak texts/ output/ -s vivian

batch-clone

Batch clone voice for multiple texts.

text2speech batch-clone <input_dir> <output_dir> -a <audio> [options]

Options:
  -a, --audio     Reference audio (required)
  -r, --ref-text  Reference transcript
  -l, --language  Language code

Example:

text2speech batch-clone texts/ output/ -a reference.wav -r "transcript"

encode

Encode audio to tokens (tokenizer).

text2speech encode <audio> [-o output.json]

Example:

text2speech encode audio.wav -o tokens.json
cat tokens.json | jq '.count'

decode

Decode tokens to audio.

text2speech decode <tokens_file> -o <output>

Example:

text2speech decode tokens.json -o output.wav

status

Check service status.

text2speech status

Shows:

API health
GPU availability
Loaded models
Speaker count

speakers

List available preset speakers.

text2speech speakers

languages

List supported languages.

text2speech languages

API Configuration

Default API: https://mc.agaii.org/TTS/api/v1

To use local backend, modify text2speech_skill/cli.py:

API_BASE = "http://localhost:24536/api/v1"

Voice Cloning Modes

ICL Mode (In-Context Learning)

Requires reference transcript (--ref-text)
Higher quality, follows reference prosody
Default mode when transcript provided

X-Vector Mode

Use --x-vector-only flag
Faster, only speaker characteristics
No transcript needed

Tips

Use @file.txt syntax to read text from file: text2speech speak @input.txt -o out.wav
Reference audio should be clear and 5-30 seconds for best cloning
ICL mode produces better results than x-vector when transcript is accurate
Batch operations save a batch_report.json with results

Troubleshooting

Job fails with "ref_text required" → Add --ref-text with transcript or use --x-vector-only

Audio quality is poor → Use clearer reference audio, or try different speaker/timbre

Timeout on long text → Break into smaller chunks, or use batch mode

text2speech

Safety Notice

Copy this and send it to your AI assistant to learn

Text2Speech Skill

Prerequisites

Installation

Via npm (Node.js)

Via pip (Python)

Direct Usage

Quick Start

Speak with Preset Speaker

Design Custom Voice

Clone Voice from Audio

Clone with Preset Timbre

Commands

speak

design

clone

batch-speak

batch-clone

encode

decode

status

speakers

languages

API Configuration

Voice Cloning Modes

ICL Mode (In-Context Learning)

X-Vector Mode

Tips

Troubleshooting

Source Transparency

Related Skills

grok-api

image-to-3d

t2i-studio-premium

frontendwebamazingstyle