Video Captions

Generate professional captions and subtitles with multi-engine transcription, word-level timing, styling presets, and burn-in.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Video Captions" with this command: npx skills add ivangdavila/video-captions

When to Use

User needs captions or subtitles for video content. Agent handles transcription, timing, formatting, styling, translation, and burn-in across all major formats and platforms.

Quick Reference

TopicFile
Transcription enginesengines.md
Output formatsformats.md
Styling presetsstyling.md
Platform requirementsplatforms.md

Core Rules

1. Engine Selection by Context

ScenarioEngineWhy
Default (recommended)Whisper local100% offline, no data leaves machine
Apple SiliconMLX WhisperNative acceleration, still local
Word timestampswhisper-timestampedDTW alignment, still local

Default: Whisper local (turbo model). See engines.md for optional cloud alternatives.

2. Format Selection by Platform

PlatformFormatNotes
YouTubeVTT or SRTVTT preferred
Netflix/ProTTMLStrict timing rules
Social (TikTok, IG)Burn-in (ASS)Embedded in video
GeneralSRTUniversal compatibility
Karaoke/effectsASSAdvanced styling

Ask user's target platform if not specified.

3. Professional Timing Standards

Netflix-compliant (default):

  • Min duration: 5/6 second (0.833s)
  • Max duration: 7 seconds
  • Max chars/line: 42
  • Max lines: 2
  • Gap between subtitles: 2+ frames

Social media:

  • Shorter segments (2-4 words)
  • More frequent breaks
  • Centered or dynamic positioning

4. Segmentation Rules

Break lines:

  • After punctuation marks
  • Before conjunctions (and, but, or)
  • Before prepositions

Never separate:

  • Article from noun
  • Adjective from noun
  • First name from last name
  • Verb from subject pronoun
  • Auxiliary from verb

5. Word-Level Timestamps

Use word timestamps for:

  • Karaoke-style highlighting
  • Precise sync verification
  • TikTok/Instagram animated captions
  • Quality checking transcript accuracy

Enable with --word-timestamps flag.

6. Speaker Identification

For multi-speaker content:

  • Use diarization (pyannote local, or cloud APIs if configured)
  • Format: [Speaker 1] or [Name] if known
  • SDH format: JOHN: What do you think?

7. Quality Verification

Before delivering:

  • Check sync at start, middle, end
  • Verify character limits per line
  • Confirm speaker labels if multi-speaker
  • Test burn-in render quality

Workflow

Basic Transcription

# Auto-detect language, output SRT
whisper video.mp4 --model turbo --output_format srt

# Specify language
whisper video.mp4 --model turbo --language es --output_format srt

# Multiple formats
whisper video.mp4 --model turbo --output_format all

Word-Level Timestamps

# Using whisper-timestamped
whisper_timestamped video.mp4 --model large-v3 --output_format srt

# With VAD pre-processing (reduces hallucinations)
whisper_timestamped video.mp4 --vad silero --accurate

Styled Subtitles (ASS)

# Generate SRT first, then convert with style
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Arial,FontSize=24,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" output.mp4

Burn-In for Social Media

# TikTok/Instagram style (centered, bold)
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Montserrat-Bold,FontSize=32,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=3,Shadow=0,Alignment=10,MarginV=50'" output.mp4

# Netflix style (bottom, clean)
ffmpeg -i video.mp4 -vf "subtitles=video.srt:force_style='FontName=Netflix Sans,FontSize=48,PrimaryColour=&HFFFFFF,OutlineColour=&H000000,Outline=2,Shadow=1,Alignment=2'" output.mp4

Translation

# Transcribe + translate to English
whisper video.mp4 --model turbo --task translate --output_format srt

Format Conversion

# SRT to VTT
ffmpeg -i video.srt video.vtt

# SRT to ASS (for styling)
ffmpeg -i video.srt video.ass

Caption Traps

  • Hallucinations on silence → Use VAD pre-processing or trim silent sections
  • Wrong language detection → Specify --language explicitly for mixed content
  • Timing drift in long videos → Use word timestamps + manual spot-check
  • Character limit violations → Set --max_line_width 42 for Netflix compliance
  • Missing speaker IDs → Enable diarization for multi-speaker content
  • Burn-in quality loss → Use high bitrate output (-b:v 8M)

Common Scenarios

YouTube Video

  1. Transcribe: whisper video.mp4 --output_format vtt
  2. Upload .vtt to YouTube Studio
  3. Review auto-sync suggestions

TikTok/Instagram Reel

  1. Transcribe with word timestamps
  2. Apply bold animated style
  3. Burn-in: ffmpeg -i video.mp4 -vf "subtitles=video.ass" -c:a copy output.mp4
  4. Export at platform resolution

Netflix/Professional

  1. Use Whisper large-v3 for best local accuracy
  2. Export TTML format
  3. Verify: 42 chars/line, 2 lines max, timing gaps
  4. Include translator credit as last subtitle

Podcast/Interview

  1. Enable speaker diarization
  2. Format as dialogue: [SPEAKER]: text
  3. SDH option: include [music], [laughter] descriptions

Foreign Film Translation

  1. Transcribe in original language
  2. Translate: --task translate for English
  3. Or use external translation + timing sync

External Endpoints

Default: 100% LOCAL processing. No network calls.

EndpointData SentWhen Used
Whisper (local)None (local)Default — always
api.assemblyai.comAudio fileOnly if user sets ASSEMBLYAI_API_KEY
api.deepgram.comAudio fileOnly if user sets DEEPGRAM_API_KEY

Cloud APIs are documented as alternatives but never used unless user explicitly provides API keys and requests cloud processing. By default, all processing stays on your machine.

Security & Privacy

Default workflow is 100% offline:

  • Whisper runs locally on your machine
  • Generated subtitle files stay local
  • Burned-in videos stay local
  • No network calls made

Cloud APIs are OPTIONAL and OPT-IN:

  • Only used if you set ASSEMBLYAI_API_KEY or DEEPGRAM_API_KEY
  • Only triggered when you explicitly use cloud engine commands
  • If you never set these keys, no audio ever leaves your machine

This skill does NOT:

  • Upload anything by default
  • Require internet connection for basic use
  • Store data externally

Related Skills

Install with clawhub install <slug> if user confirms:

  • ffmpeg — video/audio processing
  • video — general video tasks
  • video-edit — video editing
  • audio — audio processing

Feedback

  • If useful: clawhub star video-captions
  • Stay updated: clawhub sync

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Fitbit Tracker

Personal Fitbit integration for daily health tracking with adaptive sleep and activity reporting

Registry SourceRecently Updated
General

Ollama Load Balancer

Ollama load balancer for Llama, Qwen, DeepSeek, and Mistral inference across multiple machines. Load balancing with auto-discovery via mDNS, health checks, q...

Registry SourceRecently Updated
General

Google Merchant Center

Google Merchant Center integration. Manage Accounts. Use when the user wants to interact with Google Merchant Center data.

Registry SourceRecently Updated
General

Twitter/X All-in-One — Search, Monitor & Publish Text & Media Posts

Searches and reads X (Twitter): profiles, timelines, mentions, followers, tweet search, trends, lists, communities, and Spaces. Publishes posts, likes/unlike...

Registry SourceRecently Updated