TTS (Text to Speech) Skill
This skill guides the implementation of text-to-speech (TTS) functionality using the z-ai-web-dev-sdk package, enabling conversion of text into natural-sounding speech audio.
Skills Path
Skill Location: {project_path}/skills/TTS
This skill is located at the above path in your project.
Reference Scripts: Example test scripts are available in the {Skill Location}/scripts/ directory for quick testing and reference. See {Skill Location}/scripts/tts.ts for a working example.
Overview
Text-to-Speech allows you to build applications that generate spoken audio from text input, supporting various voices, speeds, and output formats for diverse use cases.
IMPORTANT: z-ai-web-dev-sdk MUST be used in backend code only. Never use it in client-side code.
API Limitations and Constraints
Before implementing TTS functionality, be aware of these important limitations:
Input Text Constraints
-
Maximum length: 1024 characters per request
-
Text exceeding this limit must be split into smaller chunks
Audio Parameters
-
Speed range: 0.5 to 2.0
-
0.5 = half speed (slower)
-
1.0 = normal speed (default)
-
2.0 = double speed (faster)
-
Volume range: Greater than 0, up to 10
-
Default: 1.0
-
Values must be greater than 0 (exclusive) and up to 10 (inclusive)
Format and Streaming
-
Streaming limitation: When stream: true is enabled, only pcm format is supported
-
Non-streaming: Supports wav , pcm , and mp3 formats
-
Sample rate: 24000 Hz (recommended)
Best Practice for Long Text
function splitTextIntoChunks(text, maxLength = 1000) { const chunks = []; const sentences = text.match(/[^.!?]+[.!?]+/g) || [text];
let currentChunk = ''; for (const sentence of sentences) { if ((currentChunk + sentence).length <= maxLength) { currentChunk += sentence; } else { if (currentChunk) chunks.push(currentChunk.trim()); currentChunk = sentence; } } if (currentChunk) chunks.push(currentChunk.trim());
return chunks; }
Prerequisites
The z-ai-web-dev-sdk package is already installed. Import it as shown in the examples below.
CLI Usage (For Simple Tasks)
For simple text-to-speech conversions, you can use the z-ai CLI instead of writing code. This is ideal for quick audio generation, testing voices, or simple automation.
Basic TTS
Convert text to speech (default WAV format)
z-ai tts --input "Hello, world" --output ./hello.wav
Using short options
z-ai tts -i "Hello, world" -o ./hello.wav
Different Voices and Speed
Use specific voice
z-ai tts -i "Welcome to our service" -o ./welcome.wav --voice tongtong
Adjust speech speed (0.5-2.0)
z-ai tts -i "This is faster speech" -o ./fast.wav --speed 1.5
Slower speech
z-ai tts -i "This is slower speech" -o ./slow.wav --speed 0.8
Different Output Formats
MP3 format
z-ai tts -i "Hello World" -o ./hello.mp3 --format mp3
WAV format (default)
z-ai tts -i "Hello World" -o ./hello.wav --format wav
PCM format
z-ai tts -i "Hello World" -o ./hello.pcm --format pcm
Streaming Output
Stream audio generation
z-ai tts -i "This is a longer text that will be streamed" -o ./stream.wav --stream
CLI Parameters
-
--input, -i <text> : Required - Text to convert to speech (max 1024 characters)
-
--output, -o <path> : Required - Output audio file path
-
--voice, -v <voice> : Optional - Voice type (default: tongtong)
-
--speed, -s <number> : Optional - Speech speed, 0.5-2.0 (default: 1.0)
-
--format, -f <format> : Optional - Output format: wav, mp3, pcm (default: wav)
-
--stream : Optional - Enable streaming output (only supports pcm format)
When to Use CLI vs SDK
Use CLI for:
-
Quick text-to-speech conversions
-
Testing different voices and speeds
-
Simple batch audio generation
-
Command-line automation scripts
Use SDK for:
-
Dynamic audio generation in applications
-
Integration with web services
-
Custom audio processing pipelines
-
Production applications with complex requirements
Basic TTS Implementation
Simple Text to Speech
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function textToSpeech(text, outputPath) { const zai = await ZAI.create();
const response = await zai.audio.tts.create({ input: text, voice: 'tongtong', speed: 1.0, response_format: 'wav', stream: false });
// Get array buffer from Response object const arrayBuffer = await response.arrayBuffer(); const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
console.log(Audio saved to ${outputPath});
return outputPath;
}
// Usage await textToSpeech('Hello, world!', './output.wav');
Multiple Voice Options
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function generateWithVoice(text, voice, outputPath) { const zai = await ZAI.create();
const response = await zai.audio.tts.create({ input: text, voice: voice, // Available voices: tongtong, chuichui, xiaochen, jam, kazi, douji, luodo speed: 1.0, response_format: 'wav', stream: false });
// Get array buffer from Response object const arrayBuffer = await response.arrayBuffer(); const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer); return outputPath; }
// Usage await generateWithVoice('Welcome to our service', 'tongtong', './welcome.wav');
Adjustable Speed
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function generateWithSpeed(text, speed, outputPath) { const zai = await ZAI.create();
// Speed range: 0.5 to 2.0 (API constraint) // 0.5 = half speed (slower) // 1.0 = normal speed (default) // 2.0 = double speed (faster) // Values outside this range will cause API errors
const response = await zai.audio.tts.create({ input: text, voice: 'tongtong', speed: speed, response_format: 'wav', stream: false });
// Get array buffer from Response object const arrayBuffer = await response.arrayBuffer(); const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer); return outputPath; }
// Usage - slower narration await generateWithSpeed('This is an important announcement', 0.8, './slow.wav');
// Usage - faster narration await generateWithSpeed('Quick update', 1.3, './fast.wav');
Adjustable Volume
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function generateWithVolume(text, volume, outputPath) { const zai = await ZAI.create();
// Volume range: greater than 0, up to 10 (API constraint) // Values must be > 0 (exclusive) and <= 10 (inclusive) // Default: 1.0 (normal volume)
const response = await zai.audio.tts.create({ input: text, voice: 'tongtong', speed: 1.0, volume: volume, // Optional parameter response_format: 'wav', stream: false });
// Get array buffer from Response object const arrayBuffer = await response.arrayBuffer(); const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer); return outputPath; }
// Usage - louder audio await generateWithVolume('This is an announcement', 5.0, './loud.wav');
// Usage - quieter audio await generateWithVolume('Whispered message', 0.5, './quiet.wav');
Advanced Use Cases
Batch Processing
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs'; import path from 'path';
async function batchTextToSpeech(textArray, outputDir) { const zai = await ZAI.create(); const results = [];
// Ensure output directory exists if (!fs.existsSync(outputDir)) { fs.mkdirSync(outputDir, { recursive: true }); }
for (let i = 0; i < textArray.length; i++) {
try {
const text = textArray[i];
const outputPath = path.join(outputDir, audio_${i + 1}.wav);
const response = await zai.audio.tts.create({
input: text,
voice: 'tongtong',
speed: 1.0,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
results.push({
success: true,
text,
path: outputPath
});
} catch (error) {
results.push({
success: false,
text: textArray[i],
error: error.message
});
}
}
return results; }
// Usage const texts = [ 'Welcome to chapter one', 'Welcome to chapter two', 'Welcome to chapter three' ];
const results = await batchTextToSpeech(texts, './audio-output'); console.log('Generated:', results.length, 'audio files');
Dynamic Content Generation
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
class TTSGenerator { constructor() { this.zai = null; }
async initialize() { this.zai = await ZAI.create(); }
async generateAudio(text, options = {}) { const { voice = 'tongtong', speed = 1.0, format = 'wav' } = options;
const response = await this.zai.audio.tts.create({
input: text,
voice: voice,
speed: speed,
response_format: format,
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
return Buffer.from(new Uint8Array(arrayBuffer));
}
async saveAudio(text, outputPath, options = {}) { const buffer = await this.generateAudio(text, options); if (buffer) { fs.writeFileSync(outputPath, buffer); return outputPath; } return null; } }
// Usage const generator = new TTSGenerator(); await generator.initialize();
await generator.saveAudio( 'Hello, this is a test', './output.wav', { speed: 1.2 } );
Next.js API Route Example
import { NextRequest, NextResponse } from 'next/server';
export async function POST(req: NextRequest) { try { const { text, voice = 'tongtong', speed = 1.0 } = await req.json();
// Import ZAI SDK
const ZAI = (await import('z-ai-web-dev-sdk')).default;
// Create SDK instance
const zai = await ZAI.create();
// Generate TTS audio
const response = await zai.audio.tts.create({
input: text.trim(),
voice: voice,
speed: speed,
response_format: 'wav',
stream: false,
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
// Return audio as response
return new NextResponse(buffer, {
status: 200,
headers: {
'Content-Type': 'audio/wav',
'Content-Length': buffer.length.toString(),
'Cache-Control': 'no-cache',
},
});
} catch (error) { console.error('TTS API Error:', error);
return NextResponse.json(
{
error: error instanceof Error ? error.message : '生成语音失败,请稍后重试',
},
{ status: 500 }
);
} }
Best Practices
- Text Preparation
function prepareTextForTTS(text) { // Remove excessive whitespace text = text.replace(/\s+/g, ' ').trim();
// Expand common abbreviations for better pronunciation const abbreviations = { 'Dr.': 'Doctor', 'Mr.': 'Mister', 'Mrs.': 'Misses', 'etc.': 'et cetera' };
for (const [abbr, full] of Object.entries(abbreviations)) { text = text.replace(new RegExp(abbr, 'g'), full); }
return text; }
- Error Handling
import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs';
async function safeTTS(text, outputPath) { try { // Validate input if (!text || text.trim().length === 0) { throw new Error('Text input cannot be empty'); }
if (text.length > 1024) {
throw new Error('Text input exceeds maximum length of 1024 characters');
}
const zai = await ZAI.create();
const response = await zai.audio.tts.create({
input: text,
voice: 'tongtong',
speed: 1.0,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
return {
success: true,
path: outputPath,
size: buffer.length
};
} catch (error) { console.error('TTS Error:', error); return { success: false, error: error.message }; } }
- SDK Instance Reuse
import ZAI from 'z-ai-web-dev-sdk';
// Create a singleton instance let zaiInstance = null;
async function getZAIInstance() { if (!zaiInstance) { zaiInstance = await ZAI.create(); } return zaiInstance; }
// Usage const zai = await getZAIInstance(); const response = await zai.audio.tts.create({ ... });
Common Use Cases
-
Audiobooks & Podcasts: Convert written content to audio format
-
E-learning: Create narration for educational content
-
Accessibility: Provide audio versions of text content
-
Voice Assistants: Generate dynamic responses
-
Announcements: Create automated audio notifications
-
IVR Systems: Generate phone system prompts
-
Content Localization: Create audio in different languages
Integration Examples
Express.js API Endpoint
import express from 'express'; import ZAI from 'z-ai-web-dev-sdk'; import fs from 'fs'; import path from 'path';
const app = express(); app.use(express.json());
let zaiInstance; const outputDir = './audio-output';
async function initZAI() { zaiInstance = await ZAI.create(); if (!fs.existsSync(outputDir)) { fs.mkdirSync(outputDir, { recursive: true }); } }
app.post('/api/tts', async (req, res) => { try { const { text, voice = 'tongtong', speed = 1.0 } = req.body;
if (!text) {
return res.status(400).json({ error: 'Text is required' });
}
const filename = `tts_${Date.now()}.wav`;
const outputPath = path.join(outputDir, filename);
const response = await zaiInstance.audio.tts.create({
input: text,
voice: voice,
speed: speed,
response_format: 'wav',
stream: false
});
// Get array buffer from Response object
const arrayBuffer = await response.arrayBuffer();
const buffer = Buffer.from(new Uint8Array(arrayBuffer));
fs.writeFileSync(outputPath, buffer);
res.json({
success: true,
audioUrl: `/audio/${filename}`,
size: buffer.length
});
} catch (error) { res.status(500).json({ error: error.message }); } });
app.use('/audio', express.static('audio-output'));
initZAI().then(() => { app.listen(3000, () => { console.log('TTS API running on port 3000'); }); });
Troubleshooting
Issue: "Input text exceeds maximum length"
- Solution: Text input is limited to 1024 characters. Split longer text into chunks using the splitTextIntoChunks function shown in the API Limitations section
Issue: "Invalid speed parameter" or unexpected speed behavior
- Solution: Speed must be between 0.5 and 2.0. Check your speed value is within this range
Issue: "Invalid volume parameter"
- Solution: Volume must be greater than 0 and up to 10. Ensure volume value is in range (0, 10]
Issue: "Stream format not supported" with WAV/MP3
- Solution: Streaming mode only supports PCM format. Either use response_format: 'pcm' with streaming, or disable streaming (stream: false ) for WAV/MP3 output
Issue: "SDK must be used in backend"
- Solution: Ensure z-ai-web-dev-sdk is only imported in server-side code
Issue: "TypeError: response.audio is undefined"
- Solution: The SDK returns a standard Response object, use await response.arrayBuffer() instead of accessing response.audio
Issue: Generated audio file is empty or corrupted
- Solution: Ensure you're calling await response.arrayBuffer() and properly converting to Buffer: Buffer.from(new Uint8Array(arrayBuffer))
Issue: Audio sounds unnatural
- Solution: Prepare text properly (remove special characters, expand abbreviations)
Issue: Long processing times
- Solution: Break long text into smaller chunks and process in parallel
Issue: Next.js caching old API route
- Solution: Create a new API route endpoint or restart the dev server
Performance Tips
-
Reuse SDK Instance: Create ZAI instance once and reuse
-
Implement Caching: Cache generated audio for repeated text
-
Batch Processing: Process multiple texts efficiently
-
Optimize Text: Remove unnecessary content before generation
-
Async Processing: Use queues for handling multiple requests
Important Notes
API Constraints
Input Text Length: Maximum 1024 characters per request. For longer text:
// Split long text into chunks const longText = "..."; // Your long text here const chunks = splitTextIntoChunks(longText, 1000);
for (const chunk of chunks) { const response = await zai.audio.tts.create({ input: chunk, voice: 'tongtong', speed: 1.0, response_format: 'wav', stream: false }); // Process each chunk... }
Streaming Format Limitation: When using stream: true , only pcm format is supported. For wav or mp3 output, use stream: false .
Sample Rate: Audio is generated at 24000 Hz sample rate (recommended setting for playback).
Response Object Format
The zai.audio.tts.create() method returns a standard Response object (not a custom object with an audio property). Always use:
// ✅ CORRECT const response = await zai.audio.tts.create({ ... }); const arrayBuffer = await response.arrayBuffer(); const buffer = Buffer.from(new Uint8Array(arrayBuffer));
// ❌ WRONG - This will not work const response = await zai.audio.tts.create({ ... }); const buffer = Buffer.from(response.audio); // response.audio is undefined
Available Voices
-
tongtong
-
温暖亲切
-
chuichui
-
活泼可爱
-
xiaochen
-
沉稳专业
-
jam
-
英音绅士
-
kazi
-
清晰标准
-
douji
-
自然流畅
-
luodo
-
富有感染力
Speed Range
-
Minimum: 0.5 (half speed)
-
Default: 1.0 (normal speed)
-
Maximum: 2.0 (double speed)
Important: Speed values outside the range [0.5, 2.0] will result in API errors.
Volume Range
-
Minimum: Greater than 0 (exclusive)
-
Default: 1.0 (normal volume)
-
Maximum: 10 (inclusive)
Note: Volume parameter is optional. When not specified, defaults to 1.0.
Remember
-
Always use z-ai-web-dev-sdk in backend code only
-
Input text is limited to 1024 characters maximum - split longer text into chunks
-
Speed must be between 0.5 and 2.0 - values outside this range will cause errors
-
Volume must be greater than 0 and up to 10 - optional parameter with default 1.0
-
Streaming only supports PCM format - use non-streaming for WAV or MP3 output
-
The SDK returns a standard Response object - use await response.arrayBuffer()
-
Convert ArrayBuffer to Buffer using Buffer.from(new Uint8Array(arrayBuffer))
-
Handle audio buffers properly when saving to files
-
Implement error handling for production applications
-
Consider caching for frequently generated content
-
Clean up old audio files periodically to manage storage