Audio Transcription with Sber Salute Speech
Transcribe audio/video files to text with timestamps via Salute Speech async REST API.
Requirements
- API Key: Environment variable
SALUTE_AUTH_DATAmust be set (Base64-encodedclient_id:client_secretor raw authorization key from https://developers.sber.ru/studio/). - SSL note: The script disables SSL verification by default (
verify_ssl=False) because Sber's certificate chain is non-standard. This is expected.
Supported formats & encodings
| Audio encoding | Content-Type | Typical extensions |
|---|---|---|
MP3 | audio/mpeg | .mp3 |
PCM_S16LE | audio/wav | .wav |
OPUS | audio/ogg | .ogg, .opus |
FLAC | audio/flac | .flac |
ALAW | audio/alaw | .alaw |
MULAW | audio/mulaw | .mulaw |
Supported languages
ru-RU, en-US, kk-KZ (Kazakh), ky-KG (Kyrgyz), uz-UZ (Uzbek).
Workflow
- Identify input files — from user request.
- Read API key from host environment.
- Run transcription — execute
salute_transcribe.pywithuvand appropriate arguments. - Deliver results — present to user human-readable transcript with timestamps to the user and give a direct link to files.
Usage
uv run --with requests {baseDir}/salute_transcribe.py \
--file /path/to/audio.mp3 \
--output_dir ~/.openclaw/workspace/transcriptions \
--lang ru-RU
Arguments
| Argument | Required | Default | Description |
|---|---|---|---|
--file | Yes | — | Path to audio/video file |
--output_dir | No | ~/.openclaw/workspace/transcribations | Output directory for results |
--lang | No | ru-RU | Language code: ru-RU, en-US, kk-KZ, ky-KG, uz-UZ |
--audio-encoding | No | MP3 | Codec: MP3, PCM_S16LE, OPUS, FLAC, ALAW, MULAW |
--model | No | general | Recognition model: general or callcenter |
--hyp-count | No | 1 | Number of alternative hypotheses: 1 or 2 |
--max-wait-time | No | 300 | Max seconds to wait for async result |
--print | No | off | Also print transcription to stdout |
Content-Type mapping
When the file extension doesn't match audio/mpeg, adjust content_type in the script or add logic. Current default is audio/mpeg (MP3). For .wav files use audio/wav, etc.
Output files
For input file meetingABC.mp3 the script produces:
| File | Description |
|---|---|
meetingABC_recognition_orig.json | Raw API response (full JSON with all hypotheses, timing, confidence) |
meetingABC_pretty.txt | Formatted human-readable transcript with timestamps |
Output text format
[00:01 - 00:20]:
Ну, даже если сосредоточиться на идее узкой щели.
[00:20 - 00:45]:
Следующий фрагмент текста здесь.
Notes
- Token is valid for ~30 minutes; the script fetches a new one each run.
- Large files (>1 hour) may need
--max-wait-timeincreased beyond 300s. - The
callcentermodel is optimized for telephony audio (8kHz, mono). - Profanity filter is disabled by default (
enable_profanity_filter=False). - The script uses normalized text by default (numbers as digits, abbreviations expanded). Raw text is also available in the JSON output.