sensevoice-transcribe

Transcribe audio files (WAV/MP3/M4A/FLAC) to timestamped text using SenseVoice-Small + FSMN-VAD. Supports single-file and batch mode with VAD-anchored per-segment timestamps (~15s granularity). Use when the user wants to transcribe speech/audio, run batch transcription on daylog recordings, or re-transcribe specific dates. Replaces the old whisper-transcribe skill.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "sensevoice-transcribe" with this command: npx skills add ylongw/sensevoice-transcribe

SenseVoice Transcribe

Transcribe audio to timestamped text using FunASR's iic/SenseVoiceSmall model with fsmn-vad for timestamp anchoring.

Pipeline

  1. FSMN-VAD segments audio into speech regions (~258 segments for 30min file)
  2. SenseVoice-Small transcribes full audio with merge_vad=True
  3. Raw text is split by <|zh|> tags → cleaned via rich_transcription_postprocess()
  4. Text segments are proportionally mapped to VAD timestamps
  5. Output: [HH:MM:SS] text per line, ~15s granularity

Environment

Venv: ~/.openclaw/venvs/sensevoice/
Python: 3.12
Key packages: funasr==1.3.1, modelscope, onnxruntime
Model cache: ~/.cache/modelscope/hub/models/iic/SenseVoiceSmall
VAD cache: ~/.cache/modelscope/hub/models/iic/speech_fsmn_vad_zh-cn-16k-common-pytorch

First-time Setup

python3 -m venv ~/.openclaw/venvs/sensevoice
source ~/.openclaw/venvs/sensevoice/bin/activate
pip install funasr modelscope onnxruntime
# Models auto-download on first run (~234MB SenseVoice + ~4MB VAD)

Usage

Single File

source ~/.openclaw/venvs/sensevoice/bin/activate
python3 -c "
from funasr import AutoModel
from funasr.utils.postprocess_utils import rich_transcription_postprocess
from datetime import datetime, timedelta
import re

wav = '<WAV_PATH>'
# Parse start time from filename: TX01_MIC015_20260308_124130_orig.wav
m = re.search(r'(\d{8})_(\d{6})', wav)
start_dt = datetime.strptime(m.group(1)+m.group(2), '%Y%m%d%H%M%S') if m else None

vad_model = AutoModel(model='fsmn-vad', disable_update=True)
model = AutoModel(model='iic/SenseVoiceSmall', vad_model='fsmn-vad',
                  vad_kwargs={'max_single_segment_time': 30000}, device='cpu')

vad_segs = vad_model.generate(input=wav)[0].get('value', [])
res = model.generate(input=wav, cache={}, language='zh', use_itn=True,
                     batch_size_s=60, merge_vad=True)

texts = [rich_transcription_postprocess(s).strip()
         for s in re.split(r'<\|zh\|>', res[0]['text']) if s.strip()]
texts = [s for s in texts if len(s) > 1]

ratio = len(vad_segs) / len(texts) if texts else 1
for i, t in enumerate(texts):
    vi = min(int(i * ratio), len(vad_segs)-1)
    ts = (start_dt + timedelta(milliseconds=vad_segs[vi][0])).strftime('%H:%M:%S') if start_dt else f'{vad_segs[vi][0]//1000:.0f}s'
    print(f'[{ts}] {t}')
"

Batch Mode (daylog)

The bundled scripts/batch_transcribe.py handles the full daylog pipeline:

source ~/.openclaw/venvs/sensevoice/bin/activate
cd ~/Documents/dec/daylog

# Dry run — see what would be transcribed
python3 scripts/batch_transcribe.py --dry-run

# Transcribe all new files
python3 scripts/batch_transcribe.py

# Re-transcribe specific dates (deletes existing, then re-runs)
python3 scripts/batch_transcribe.py --force-dates 2026-03-07,2026-03-08

# With progress file + Discord webhook
python3 scripts/batch_transcribe.py \
  --progress-file /tmp/daylog-progress.json \
  --discord-webhook https://discord.com/api/webhooks/...

Flags:

FlagDescription
--dry-runPreview without writing
--engine sensevoice|whisperEngine (default: sensevoice)
--force-dates YYYY-MM-DD,...Delete & re-transcribe these dates
--progress-file PATHWrite JSON progress for monitoring
--discord-webhook URLPost start/milestone/finish to Discord

Directory layout:

daylog/
├── raw/                          # WAV input (DJI MIC 3, 48kHz/32bit, ~247MB/30min)
│   ├── TX01_MIC009_20260308_094129_orig.wav
│   └── ...
├── transcripts/                  # Output, grouped by date
│   └── 2026-03-08/
│       ├── 000_TX01_MIC009_20260308_094129_orig.txt
│       └── ...
└── notes/                        # Compiled daily notes (separate step)
    └── 2026-03-08.md

Behavior:

  • Groups WAV files by date extracted from filename (YYYYMMDD)
  • Sorts by timestamp within each date for correct chronological order
  • Skips already-transcribed files unless --force-dates
  • Indexed output filenames (000_, 001_, ...) for sort order
  • Discord milestones every 25% progress

Output Format

[录音开始: 09:41:29]
[09:41:35] 到了,我们下车吧。
[09:41:48] 武康大楼,人好多啊。
[09:42:04] 你帮我在这里拍一张。
...

Performance (Apple M4, 10-core CPU)

MetricValue
RTF~0.04 (25x realtime)
CPU~1.2 cores (12%)
RAM~1.5GB
30min WAV~73s transcription + ~4s VAD
Accuracy92% keyword accuracy (vs Whisper-medium 23%, turbo 38%)
Hallucinations0 (vs Whisper hundreds per session)
Model size234MB (vs Whisper-large-v3-turbo 1.5GB)

vs Old Whisper Skill

Whisper (old)SenseVoice (new)
Modelmlx-whisper-mediumSenseVoice-Small (234MB)
Accuracy23-38%92%
HallucinationsHundreds per session0
TimestampPer-word (~2-4s)VAD-anchored (~15s)
Duplicate lines~23%<0.2%
Chinese supportWeakNative (Mandarin-optimized)

Emoji Note

SenseVoice appends emotion tags (😊😔😡😮) to segments. These are model artifacts reflecting detected speech emotion, not literal emoji in the audio. Downstream consumers (note compilation) should ignore or strip them.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Wangdongjie Cfo Skill

基于王东杰26年实战经验,提供A+H双市场IPO操盘、资本杠杆设计、业财融合和AI数字化风控咨询。

Registry SourceRecently Updated
General

Hk Stock Morning Report

Generate HK stock market morning report (股市晨報) for Chinese bank trading desk. Use when user asks "生成晨报", "股市晨报", "今日股市", "港股晨報", or any similar HK stock mark...

Registry SourceRecently Updated
General

Nansen Mpp Payment

Pay-per-call access to the Nansen API via MPP (Tempo). Use when a user wants anonymous Nansen access without an API key and without managing their own Base/S...

Registry SourceRecently Updated
General

Etsy Autolist

Auto-create and manage digital product listings on Etsy. Creates listings from existing digital product files (PDFs, templates, spreadsheets) using Etsy Open...

Registry SourceRecently Updated