Say + FFmpeg TTS Pipeline
Use say (macOS native TTS) + ffmpeg to generate Opus voice messages for Telegram/Matrix.
Why not just say?
- Telegram/Matrix require Opus codec voice messages
sayoutputs AIFF/m4a; must convert to.ogg(Opus) before sending- Telegram accepts: OGG/MP3/M4A as voice — but Opus OGG is the native format
Workflow
say -v "<voice>" -o <tmpdir>/<name>.aiff "<text>"
ffmpeg -i <tmpdir>/<name>.aiff -acodec libopus <tmpdir>/<name>.ogg -y
Send with message tool:
{
"action": "send",
"channel": "telegram",
"media": "<tmpdir>/<name>.ogg",
"asVoice": true,
"target": "<chat_id>"
}
Recommended workspace directory
~/.openclaw/workspace/tmp/audio/
(Whitelist this path in exec permissions for faster approval)
Voice selection
Use say -v '?' to list available voices. Notable ones:
Trinoids— robotic/electronic voice (popular for bots)Samantha— warm US female voiceAlex— US male voiceFred— neutral US male voiceKaren— Australian female voice
Note: pass just the voice name (e.g. "Trinoids"), not the full en_US suffix.
Example: send a hello voice message
VOICE="Trinoids"
TEXT="Hello!"
DIR="$HOME/.openclaw/workspace/tmp/audio"
mkdir -p "$DIR"
say -v "$VOICE" -o "$DIR/hello.aiff" "$TEXT"
ffmpeg -i "$DIR/hello.aiff" -acodec libopus "$DIR/hello.ogg" -y
# Then send via message tool with asVoice: true
Format notes
- Input to ffmpeg: AIFF (
.aiff) works reliably; avoid.m4awithsay - Output: Opus in Ogg container (
libopuscodec) — required for Telegram voice messages - Telegram
sendVoiceaccepts: OGG, MP3, M4A — but native is Opus OGG - Sample rate:
sayoutputs 24kHz AIFF; ffmpeg re-encodes to Opus at 24kHz
Integration with OpenClaw TTS
OpenClaw's built-in messages.tts only supports: ElevenLabs, Microsoft Edge, MiniMax, OpenAI.
This say+ffmpeg pipeline is a workaround for local-only TTS without API keys or cloud services. It's not auto-triggered by OpenClaw — call it manually via exec + message tool.
Language Detection → Voice Mapping
When responding to a voice message, detect the language from the STT output (Parakeet auto-detects). Then pick the matching say voice using i18n locale codes.
Finding voices by language:
say -v '?' 2>&1 | grep -E "cs_CZ|en_US|de_DE|fr_FR|it_IT|es_ES"
Language → voice selection priority:
- Use
<voice> (Premium)if available - Fall back to
<voice> (Enhanced)if available - Fall back to base
<voice>name - Never use a voice that doesn't match the language
| Language | i18n code | Preferred Voice |
|---|---|---|
| Czech | cs_CZ | Zuzana (Premium) |
| English (US) | en_US | Trinoids (no Premium/Enhanced available) |
| German | de_DE | Grandma (Premium) if available |
| French | fr_FR | Grandma (Premium) if available |
| Spanish | es_ES | Grandma (Premium) if available |
| Italian | it_IT | Grandma (Premium) if available |
Key: Always use just the voice name (e.g. "Trinoids", "Zuzana"), not the full locale suffix. The locale suffix in say -v '?' output is for grepping/identification only.
Example workflow:
LANG="cs_CZ"
# Find best available voice for this language (Premium > Enhanced > base)
VOICE=$(say -v '?' 2>&1 | grep "$LANG" | head -3 | awk '{print $1}' | sed -n '1p')
say -v "$VOICE" -o reply.aiff "Česká odpověď"
ffmpeg -i reply.aiff -acodec libopus reply.ogg -y
TODOs
- Detect language from STT transcription and auto-select appropriate
sayvoice - Explore integrating into OpenClaw via custom TTS provider plugin
- Investigate if OpenClaw supports post-processing TTS output via a hook
- Test Matrix channel voice message format compatibility