youtube-transcribe

Transcribe YouTube videos with smart fallback: extracts captions first (fast, free), falls back to local Whisper transcription when no captions available. Auto-detects best Whisper backend (MLX/faster-whisper/openai-whisper) and model size based on hardware. Use when the user shares a YouTube link and wants to know what it says, get a transcript, summarize, or analyze video content. Keywords: YouTube, transcribe, transcript, subtitles, captions, speech-to-text, whisper, mlx, video to text.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "youtube-transcribe" with this command: npx skills add iml885203/youtube-transcribe

YouTube Transcribe

Smart YouTube video transcription with automatic fallback:

  1. Captions first — extracts existing subtitles (manual or auto-generated) via yt-dlp. Fast, free, no compute.
  2. Whisper fallback — when no captions exist, downloads audio and transcribes locally with the best available Whisper backend.

When to Use

Use this skill when the user wants to:

  • Get a transcript or text version of a YouTube video
  • Understand what a YouTube video says without watching it
  • Summarize, analyze, or take notes from a YouTube video
  • Extract subtitles or captions from a video

Triggers

  • "transcribe this YouTube video"
  • "what does this video say"
  • "get the transcript of [YouTube URL]"
  • "summarize this YouTube video" (transcribe first, then process)
  • Any YouTube URL shared with a request to understand its content

Requirements

Required:

  • yt-dlp — for caption extraction and audio download
  • python3

For Whisper fallback (when no captions available):

  • ffmpeg — for audio processing
  • One of these Whisper backends (auto-detected in priority order):
    1. mlx-whisper — Apple Silicon native, fastest on Mac (pip install mlx-whisper)
    2. faster-whisper — CTranslate2 backend, fast on CUDA/CPU (pip install faster-whisper)
    3. openai-whisper — Original Whisper, universal fallback (pip install openai-whisper)

Usage

Basic — transcribe a video

python3 {baseDir}/scripts/transcribe.py "https://www.youtube.com/watch?v=VIDEO_ID"

Specify language for captions

python3 {baseDir}/scripts/transcribe.py "URL" --language zh

Force Whisper (skip caption check)

python3 {baseDir}/scripts/transcribe.py "URL" --force-whisper

JSON output

python3 {baseDir}/scripts/transcribe.py "URL" --format json

Save to file

python3 {baseDir}/scripts/transcribe.py "URL" --output transcript.txt

Options

FlagDefaultDescription
--languageautoPreferred subtitle/transcription language (e.g. zh, en, ja)
--formattextOutput format: text, json, srt, vtt
--outputstdoutSave transcript to file
--force-whisperfalseSkip caption extraction, go straight to Whisper
--backendautoWhisper backend: auto, mlx, faster-whisper, whisper
--modelautoWhisper model size: auto, large-v3, medium, small, base, tiny

Environment Variables

VariableDescription
YT_WHISPER_BACKENDOverride Whisper backend selection
YT_WHISPER_MODELOverride Whisper model size

Auto-Detection

Whisper Backend (priority order)

  1. MLX Whisper — detected via import mlx_whisper. Best for Apple Silicon.
  2. faster-whisper — detected via import faster_whisper. Best for CUDA GPU, good on CPU.
  3. OpenAI Whisper — detected via import whisper. Universal fallback.

Model Size (based on available RAM)

RAMModelVRAM/RAM Usage
≥16GBlarge-v3~6-10GB
≥8GBmedium~5GB
≥4GBsmall~2.5GB
<4GBbase~1.5GB

Caption Language Priority

When --language is not specified, captions are searched in this order:

  1. Video's original language
  2. Chinese variants: zh-Hant, zh-Hans, zh-TW, zh-CN, zh
  3. English: en
  4. Any available language

Output Formats

text (default)

Plain text transcript, one continuous block.

json

{
  "video_id": "ZSnYlbIYpjs",
  "title": "Video Title",
  "channel": "Channel Name",
  "duration": 708,
  "language": "zh",
  "method": "captions",
  "transcript": [
    {"start": 0.0, "end": 5.2, "text": "..."},
    ...
  ],
  "full_text": "Complete transcript as single string"
}

srt / vtt

Standard subtitle formats with timestamps.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Wangdongjie Cfo Skill

基于王东杰26年实战经验,提供A+H双市场IPO操盘、资本杠杆设计、业财融合和AI数字化风控咨询。

Registry SourceRecently Updated
General

Hk Stock Morning Report

Generate HK stock market morning report (股市晨報) for Chinese bank trading desk. Use when user asks "生成晨报", "股市晨报", "今日股市", "港股晨報", or any similar HK stock mark...

Registry SourceRecently Updated
General

Nansen Mpp Payment

Pay-per-call access to the Nansen API via MPP (Tempo). Use when a user wants anonymous Nansen access without an API key and without managing their own Base/S...

Registry SourceRecently Updated
General

Etsy Autolist

Auto-create and manage digital product listings on Etsy. Creates listings from existing digital product files (PDFs, templates, spreadsheets) using Etsy Open...

Registry SourceRecently Updated