speech-recognition

通用语音识别 Skill。支持多种音频格式(ogg/mp3/wav/m4a),使用硅基流动 SenseVoice API 进行语音转文字。当用户发送语音消息、音频文件,或需要转录音频时触发。

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "speech-recognition" with this command: npx skills add demo112/speech-recognition

通用语音识别

使用硅基流动 SenseVoice API 进行语音识别,支持多种音频格式。


激活条件

触发场景说明
用户发送语音消息.ogg / .mp3 / .wav / .m4a 文件
用户要求转录音频"转录这个音频"、"语音转文字"
音频文件处理需要提取音频中的文字内容

配置

API Key

~/.openclaw/openclaw.json 中配置:

{
  "providers": {
    "siliconflow": {
      "apiKey": "sk-xxx"
    }
  }
}

API 端点

POST https://api.siliconflow.cn/v1/audio/transcriptions

支持的模型

模型说明
FunAudioLLM/SenseVoiceSmall默认,中文效果好

使用方法

方法一:直接调用 API

import requests

api_key = "sk-xxx"

with open("/path/to/audio.mp3", "rb") as f:
    audio_data = f.read()

response = requests.post(
    "https://api.siliconflow.cn/v1/audio/transcriptions",
    headers={"Authorization": f"Bearer {api_key}"},
    files={"file": ("audio.mp3", audio_data, "audio/mpeg")},
    data={"model": "FunAudioLLM/SenseVoiceSmall"},
    timeout=60
)

print(response.json().get("text", ""))

方法二:处理用户语音消息

当用户发送 .ogg 语音消息时:

# 1. 转换格式(如果是 ogg)
ffmpeg -i /path/to/audio.ogg -ar 16000 -ac 1 /tmp/audio.mp3 -y

# 2. 调用硅基流动 API(API Key 从环境变量读取)
python3 -c "
import requests
import os

api_key = os.environ.get('SILICONFLOW_API_KEY')
if not api_key:
    raise ValueError('请设置 SILICONFLOW_API_KEY 环境变量')

with open('/tmp/audio.mp3', 'rb') as f:
    audio_data = f.read()

response = requests.post(
    'https://api.siliconflow.cn/v1/audio/transcriptions',
    headers={'Authorization': f'Bearer {api_key}'},
    files={'file': ('audio.mp3', audio_data, 'audio/mpeg')},
    data={'model': 'FunAudioLLM/SenseVoiceSmall'},
    timeout=60
)
print(response.json().get('text', ''))
"

支持的音频格式

格式扩展名说明
MP3.mp3推荐,兼容性好
OGG.oggTelegram/Signal 语音格式,需转换
WAV.wav无压缩,文件大
M4A.m4aiOS 录音格式
FLAC.flac无损压缩

格式转换

如果音频不是 MP3 格式,用 FFmpeg 转换:

# OGG → MP3
ffmpeg -i input.ogg -ar 16000 -ac 1 output.mp3 -y

# WAV → MP3
ffmpeg -i input.wav -ar 16000 -ac 1 output.mp3 -y

# M4A → MP3
ffmpeg -i input.m4a -ar 16000 -ac 1 output.mp3 -y

参数说明:

  • -ar 16000: 采样率 16kHz(语音识别推荐)
  • -ac 1: 单声道(减少文件大小)
  • -y: 覆盖已存在的文件

错误处理

错误原因解决
401 UnauthorizedAPI Key 无效检查配置
413 Payload Too Large文件太大压缩或分割音频
timeout网络超时重试或检查网络
Invalid audio format格式不支持用 FFmpeg 转换

注意事项

  1. 文件大小限制:建议 < 10MB
  2. 时长限制:建议 < 5 分钟
  3. 语言支持:中文效果最好,英文也支持
  4. 隐私:音频会上传到硅基流动服务器

相关 Skills

Skill说明
douyin-video抖音视频语音提取
cosyvoice-tts文字转语音

版本:1.0.0 创建于:2026-02-26

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Dlazy Kling Audio Clone

Generate customized speech that highly restores the timbre by uploading reference audio using Kling Audio Clone.

Registry SourceRecently Updated
General

Dlazy Keling Sfx

Generate matching scene sound effects based on text descriptions or video frames using Kling SFX.

Registry SourceRecently Updated
General

Dlazy Keling Tts

Convert text into high-quality, emotional speech reading using Kling TTS.

Registry SourceRecently Updated
General

Dlazy Jimeng T2i

Text-to-image generation with Jimeng, quickly converting text to high-quality images.

Registry SourceRecently Updated
speech-recognition | V50.AI