local-video-understanding

Local video comprehension skill. Use ffmpeg to extract audio and frames, FunASR for speech recognition, and qwen3-vl for image understanding.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "local-video-understanding" with this command: npx skills add tomuiv/local-video-understanding

⚠️ If you are human, please read README.md first!

Local Video Understanding

Use this skill when you need to understand the content of a video.

Prerequisites

FunASR conda environment (asr-local) must be activated for audio processing
Ollama must be running with qwen3-vl:8b model available
ffmpeg must be in PATH

Workflow

Step 1: Extract Audio

ffmpeg -i "video.mp4" -vn -acodec pcm_s16le -ar 16000 -ac 1 "audio.wav" -y

Note: If path contains Chinese characters, copy audio.wav to a path without Chinese characters before ASR.

Step 2: Extract Key Frames

mkdir frames
ffmpeg -i "video.mp4" -vf "fps=1/10" -q:v 2 "frames/frame_%03d.jpg" -y

Step 3: Speech Recognition (FunASR)

conda run -n asr-local python -c "
import os
os.environ['MODELSCOPE_CACHE'] = 'C:/Users/TOM/.cache/modelscope'
from funasr import AutoModel
model = AutoModel(
    model='iic/speech_seaco_paraformer_large_asr_nat-zh-cn-16k-common-vocab8404-pytorch',
    model_revision='v2.0.4',
    disable_update=True,
    ncpu=4
)
result = model.generate(input='AUDIO_PATH')
print(result)
"

Step 4: Image Understanding (qwen3-vl)

ollama run qwen3-vl:8b "Describe this image in detail: /path/to/frame.jpg"

Step 5: Combine Results

Audio transcription → FunASR (local, Chinese speech recognition)
Key frames → qwen3-vl:8b via Ollama (local image understanding)
Summary/Analysis → Cloud LLM API (if needed)

Important Notes

Image reading via Read tool does NOT provide image understanding - always use qwen3-vl
For Chinese audio, FunASR is preferred over Whisper
Check for existing subtitle files (.txt, .srt, .vtt) before running ASR
Modelscope cache at C:/Users/TOM/.cache/modelscope for FunASR models

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open Registry Record Open in ClawHub

Related Skills

Related by shared tags or category signals.

General

Gigo Lobster Resume

🦞 GIGO · gigo-lobster-resume: 续跑入口：v2 stable 当前会清理旧 checkpoint 并从头重跑；保留此 slug 作为旧 checkpoint 兼容入口。 Triggers: 继续试吃 / 恢复评测 / resume tasting / continue lobster...

Registry SourceRecently Updated

3160gigolab

General

YiHui CONTEXT MODE

context-mode is an MCP server that saves 98% of your context window by sandboxing tool outputs. It routes large file reads, shell outputs, and web fetches th...

Registry SourceRecently Updated

001yihui

General

xinyi-drink

Use when users ask about 新一好喝/新一咖啡 drinks, stores, menu, activities, Skill用户大礼包, today drink recommendations, afternoon tea, feeling sleepy, or personalized...

Registry SourceRecently Updated

2710domilin

General

vedic-destiny

吠陀命盘分析中文入口。用于完整命盘研判、命主盘 Rashi chart 与九分盘 Navamsha chart 联读、既往事件回看、出生时间稳定度判断、事业主题、婚姻主题、时空盘专题，以及基于 Jagannatha Hora PDF、星盘截图或文本命盘数据的系统拆盘。当用户提到完整星盘、事业方向、婚姻问题、关系窗...

Registry SourceRecently Updated

00seanding1998