Local Voice Reply
Use this skill to turn text into a cloned/custom-voice audio reply and deliver it reliably to Feishu or Discord.
Structured skill definition
- Purpose: local low-latency voice replies in Opus/Ogg.
- Channels: Feishu + Discord.
- Default voice:
juno(reference file:voice/juno_ref.wav). - Custom voice modes:
- File-based: replace/update
voice/juno_ref.wav. - Registry-based: upload/register voices via
POST /voice/register, then call byvoice_name.
- File-based: replace/update
- Output:
.opus(Ogg container) under.openclaw/media/outbound/voice-server-v3/(orTARVIS_VOICE_OUTPUT_DIR). - Control scripts:
scripts/send_voice_reply.ps1(server API path)scripts/generate_cuda_voice.ps1(stable local CUDA generation path)
Server implementation is kept with the skill (not workspace root):
server/voice_server_v3.py(FastAPI routes)server/voice_engine.py(generation and cache engine)
Voice assets are also colocated with the skill:
voice/
Runtime requirements
ffmpegmust be installed and available onPATH(required for Opus encoding).- Python packages required by the server:
fastapiuvicornpython-multipartchatterbox-ttstorchtorchaudionumpy
- On first startup,
ChatterboxTTS.from_pretrained()may download model assets, so initial run can require network access and additional disk. - Optional env vars:
TARVIS_VOICE_OUTPUT_DIRto override where generated Opus files are written.TARVIS_VOICE_DEVICEto force device selection (cuda/gpu,mps, orcpu).
Persistence behavior
- Uploaded voice samples from
POST /voice/registerare persisted underserver/voices/. - Cache and registry data are persisted under
server/voice_cache/. - Generated Opus outputs are written under
.openclaw/media/outbound/voice-server-v3/by default (orTARVIS_VOICE_OUTPUT_DIRwhen set). POST /output/cleanuponly deletes staged.opusfiles inside the configured output directory and their.jsonsidecar files.
Use this workflow
- Ensure local v3.3 TTS server is running from this skill folder:
python -m uvicorn --app-dir server voice_server_v3:app --host 127.0.0.1 --port 8000
- Call
/speakwithtext(and optionalspeed,exaggeration,cfg).voice_namedefaults tojuno.
- Receive Opus directly from server (
audio/ogg) in Juno voice. - Save final media into allowed path:
C:\Users\hanli\.openclaw\media\outbound\
- Send with
messagetool:action=sendfilePath=<allowed-path>asVoice=true- For Feishu:
channel=feishu - For Discord:
channel=discord
Voice customization guide
A) Replace default Juno reference
- Replace
voice/juno_ref.wavwith your target reference voice sample. - Keep sample clean (single speaker, low noise, clear pronunciation).
- Restart server and test with
voice_name=juno.
B) Register additional named voices
- Call
POST /voice/registerwith a reference sample and targetvoice_name. - Confirm registration under
server/voices/. - Generate with that
voice_namein/speakor/speak_stream.
Defaults
voice_name:junospeed:1.2- Output format: Opus in Ogg container from server
/speak(no post-conversion) - Discord compatibility: Ogg/Opus is supported and can be sent as voice/audio with
asVoice=true
Speed Improvements In This Version
- Caches model capability lookups once at startup.
- Uses
torch.inference_mode()during synthesis to reduce overhead. - Reuses phrase cache for both
/speakand/speak_stream. - Improves chunking behavior for long CJK text to avoid oversized chunks.
- Keeps latency metrics for benchmarking and tuning.
Common failure and fix
- Error:
LocalMediaAccessError ... path-not-allowed - Fix: copy the file into
.openclaw/media/outboundbefore sending.
Script
Use scripts/send_voice_reply.ps1 to generate Opus directly with defaults (voice_name=juno, speed=1.2).
It auto-selects /speak_stream for longer text (or when -Stream is passed) for better throughput.
For stable CUDA generation command patterns under stricter exec approval policies, use:
scripts/generate_cuda_voice.ps1 -Text "..."This keeps the outer command shape fixed soallow-alwaysis more reusable.