VibeVoice TTS
Local text-to-speech using Microsoft's VibeVoice model. Generates natural Spanish voice audio, perfect for WhatsApp voice messages.
Quick Start
# Basic usage
{baseDir}/scripts/vv.sh "Hola, esto es una prueba" -o /tmp/audio.ogg
# From file
{baseDir}/scripts/vv.sh -f texto.txt -o /tmp/audio.ogg
# Different voice
{baseDir}/scripts/vv.sh "Texto" -v en-Wayne -o /tmp/audio.ogg
# Adjust speed (0.5-2.0)
{baseDir}/scripts/vv.sh "Texto" -s 1.2 -o /tmp/audio.ogg
Configuration
| Setting | Default | Description |
|---|---|---|
| Voice | sp-Spk1_man | Spanish male voice (slight Mexican accent) |
| Speed | 1.15 | 15% faster than normal |
| Format | .ogg | Opus codec for WhatsApp |
Available Voices
Spanish:
sp-Spk1_man- Male, slight Mexican accent (default)
English:
en-Wayne- Maleen-Denise- Female- Other voices in
~/VibeVoice/demo/voices/streaming_model/
Output Formats
.ogg- Opus codec (WhatsApp compatible, recommended).mp3- MP3 format.wav- Uncompressed WAV
For WhatsApp
Always use .ogg format with asVoice=true in the message tool:
# Generate
{baseDir}/scripts/vv.sh "Tu mensaje aquí" -o /tmp/mensaje.ogg
# Send via message tool
message action=send channel=whatsapp to="+34XXXXXXXXX" filePath=/tmp/mensaje.ogg asVoice=true
Requirements
- GPU: NVIDIA with ~2GB VRAM
- VibeVoice: Installed at
~/VibeVoice - ffmpeg: For audio conversion
- Python 3.10+: With torch, torchaudio
Performance
- RTF: ~0.24x (generates faster than realtime)
- 1 minute of audio ≈ 15 seconds to generate
Notes
- First run loads model (~10s), subsequent runs are faster
- Audio rule: Only send voice if user requests it or speaks via audio
- Keep text under 1500 chars for best quality