Smallest AI — Ultra-Fast Voice Suite
Text-to-speech (sub-100ms) via Lightning v3.1 and speech-to-text (64ms TTFT) via Pulse.
Setup
- Get API key from https://waves.smallest.ai → click "API Key" in left panel
- Set
SMALLEST_API_KEYin your environment:
export SMALLEST_API_KEY="your_key_here"
Defaults
- Default female voice:
sophia(American English) - Default male voice:
robert(American English) - Default language:
en - Default speed:
1.0 - Default sample rate:
24000
Voice Selection Rules
Follow these rules to select the voice:
- If user explicitly names a voice (e.g. "use advika"), use that voice.
- If user asks for a male voice, use the configured
defaultVoiceMale. - If user asks for a female voice, use the configured
defaultVoiceFemale. - If no gender preference, use
defaultVoiceFemale(sophia by default). - For Hindi content: use
advika(female) orvivaan(male). - For Spanish content: use
camilla(female) orcarlos(male). - For Tamil content: use
anitha(female) orraju(male).
Always pass the configured defaultLanguage, defaultSpeed, and defaultSampleRate as --lang, --speed, and --rate flags unless the user overrides them.
Text-to-Speech
Generate speech audio from text using Lightning v3.1 model.
Shell (preferred — zero dependencies)
{baseDir}/scripts/tts.sh "Text to speak" --voice sophia --rate 24000 --speed 1.0 --lang en
Python (requires pip install smallestai or just requests)
python3 {baseDir}/scripts/tts.py "Text to speak" --voice sophia --speed 1.0 --lang en --out speech.wav
Voices
| Voice | Gender | Accent | Best For |
|---|---|---|---|
| sophia | Female | American | General use (default) |
| robert | Male | American | Professional, reports (default) |
| advika | Female | Indian | Hindi content, code-switch |
| vivaan | Male | Indian | Bilingual English/Hindi |
| camilla | Female | Mexican/Latin | Spanish content |
| zara | Female | American | Conversational |
| melody | Female | American | Storytelling, greetings |
| arjun | Male | Indian | English/Hindi bilingual |
| stella | Female | American | Expressive, warm |
80+ more voices available. List all with: {baseDir}/scripts/voices.sh
Options
--voice <id>: Voice identifier (default: sophia)--rate <hz>: Sample rate — 8000 | 16000 | 24000 | 44100 (default: 24000)--speed <n>: Playback speed 0.5–2.0 (default: 1.0)--lang <code>: Language code (default: en). See{baseDir}/references/languages.md--out <path>: Output file (default: auto-namedmedia/tts_<timestamp>.wav)
Output
Scripts print MEDIA: <filepath> on success. OpenClaw sends this as an audio attachment.
Multilingual
Supports 30+ languages. Pass --lang with ISO code:
{baseDir}/scripts/tts.sh "नमस्ते, कैसे हैं आप?" --voice advika --lang hi
{baseDir}/scripts/tts.sh "Bonjour le monde" --voice sophia --lang fr
{baseDir}/scripts/tts.sh "Hola, buenos días" --voice camilla --lang es
Code-switching (mixing languages) works automatically — no flag needed:
{baseDir}/scripts/tts.sh "Hey, मुझे meeting remind कर दो" --voice advika --lang hi
Speech-to-Text
Transcribe audio files using Pulse model. Supports WAV, MP3, OGG, FLAC.
Shell
{baseDir}/scripts/stt.sh /path/to/audio.wav
{baseDir}/scripts/stt.sh /path/to/audio.wav --diarize --timestamps --emotions
Python
python3 {baseDir}/scripts/stt.py /path/to/audio.wav --diarize --timestamps --lang en
Options
--lang <code>: Language (default: en)--diarize: Identify different speakers--timestamps: Word-level timing--emotions: Detect emotional tone
Output
Returns JSON with transcription field. With --diarize, includes speaker labels per word.
When to Use
Trigger this skill when the user:
- Asks to "say", "speak", "read aloud", or "generate speech/audio"
- Wants a "voice message", "voice note", or "audio file"
- Asks to "transcribe", "convert speech/audio to text"
- Mentions "Smallest AI", "Lightning TTS", or "Pulse STT"
- Needs fast or low-latency speech generation
- Wants Hindi, Spanish, multilingual, or code-switched voice output
- Asks to compare TTS providers or benchmark latency
Error Handling
- Missing API key → tell user to set
SMALLEST_API_KEY - HTTP 401 → invalid or expired API key
- HTTP 429 → rate limited, wait and retry
- HTTP 400 → check text length (max ~5000 chars per request). Split long text into chunks.
- Empty audio → verify voice_id is valid
Limits
- Max text per request: ~5000 characters
- For longer text: split into sentences, synthesize each, concatenate with sox or ffmpeg
- Free tier: 30 minutes/month of TTS
- Basic ($5/mo): 3 hours of TTS + 1 voice clone