baoyu-youtube-transcript

Downloads YouTube video transcripts/subtitles and cover images by URL or video ID. Supports multiple languages, translation, chapters, and speaker identification. Caches raw data for fast re-formatting. Use when user asks to "get YouTube transcript", "download subtitles", "get captions", "YouTube字幕", "YouTube封面", "视频封面", "video thumbnail", "video cover image", or provides a YouTube URL and wants the transcript/subtitle text or cover image extracted.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "baoyu-youtube-transcript" with this command: npx skills add jimliu/baoyu-skills/jimliu-baoyu-skills-baoyu-youtube-transcript

YouTube Transcript

Downloads transcripts (subtitles/captions) from YouTube videos. Works with both manually created and auto-generated transcripts. No API key or browser required — uses YouTube's InnerTube API directly.

Fetches video metadata and cover image on first run, caches raw data for fast re-formatting.

Script Directory

Scripts in scripts/ subdirectory. {baseDir} = this SKILL.md's directory path. Resolve ${BUN_X} runtime: if bun installed → bun; if npx available → npx -y bun; else suggest installing bun. Replace {baseDir} and ${BUN_X} with actual values.

ScriptPurpose
scripts/main.tsTranscript download CLI

Usage

# Default: markdown with timestamps (English)
${BUN_X} {baseDir}/scripts/main.ts <youtube-url-or-id>

# Specify languages (priority order)
${BUN_X} {baseDir}/scripts/main.ts <url> --languages zh,en,ja

# Without timestamps
${BUN_X} {baseDir}/scripts/main.ts <url> --no-timestamps

# With chapter segmentation
${BUN_X} {baseDir}/scripts/main.ts <url> --chapters

# With speaker identification (requires AI post-processing)
${BUN_X} {baseDir}/scripts/main.ts <url> --speakers

# SRT subtitle file
${BUN_X} {baseDir}/scripts/main.ts <url> --format srt

# Translate transcript
${BUN_X} {baseDir}/scripts/main.ts <url> --translate zh-Hans

# List available transcripts
${BUN_X} {baseDir}/scripts/main.ts <url> --list

# Force re-fetch (ignore cache)
${BUN_X} {baseDir}/scripts/main.ts <url> --refresh

Options

OptionDescriptionDefault
<url-or-id>YouTube URL or video ID (multiple allowed)Required
--languages <codes>Language codes, comma-separated, in priority orderen
--format <fmt>Output format: text, srttext
--translate <code>Translate to specified language code
--listList available transcripts instead of fetching
--timestampsInclude [HH:MM:SS → HH:MM:SS] timestamps per paragraphon
--no-timestampsDisable timestamps
--chaptersChapter segmentation from video description
--speakersRaw transcript with metadata for speaker identification
--exclude-generatedSkip auto-generated transcripts
--exclude-manually-createdSkip manually created transcripts
--refreshForce re-fetch, ignore cached data
-o, --output <path>Save to specific file pathauto-generated
--output-dir <dir>Base output directoryyoutube-transcript

Input Formats

Accepts any of these as video input:

  • Full URL: https://www.youtube.com/watch?v=dQw4w9WgXcQ
  • Short URL: https://youtu.be/dQw4w9WgXcQ
  • Embed URL: https://www.youtube.com/embed/dQw4w9WgXcQ
  • Shorts URL: https://www.youtube.com/shorts/dQw4w9WgXcQ
  • Video ID: dQw4w9WgXcQ

Output Formats

FormatExtensionDescription
text.mdMarkdown with frontmatter (incl. description), title heading, summary, optional TOC/cover/timestamps/chapters/speakers
srt.srtSubRip subtitle format for video players

Output Directory

youtube-transcript/
├── .index.json                          # Video ID → directory path mapping (for cache lookup)
└── {channel-slug}/{title-full-slug}/
    ├── meta.json                        # Video metadata (title, channel, description, duration, chapters, etc.)
    ├── transcript-raw.json              # Raw transcript snippets from YouTube API (cached)
    ├── transcript-sentences.json        # Sentence-segmented transcript (split by punctuation, merged across snippets)
    ├── imgs/
    │   └── cover.jpg                    # Video thumbnail
    ├── transcript.md                    # Markdown transcript (generated from sentences)
    └── transcript.srt                   # SRT subtitle (generated from raw snippets, if --format srt)
  • {channel-slug}: Channel name in kebab-case
  • {title-full-slug}: Full video title in kebab-case

The --list mode outputs to stdout only (no file saved).

Caching

On first fetch, the script saves:

  • meta.json — video metadata, chapters, cover image path, language info
  • transcript-raw.json — raw transcript snippets from YouTube API ({ text, start, duration }[])
  • transcript-sentences.json — sentence-segmented transcript ({ text, start: "HH:mm:ss", end: "HH:mm:ss" }[]), split by sentence-ending punctuation (.?!…。?! etc.), timestamps proportionally allocated by character length, CJK-aware text merging
  • imgs/cover.jpg — video thumbnail

Subsequent runs for the same video use cached data (no network calls). Use --refresh to force re-fetch. If a different language is requested, the cache is automatically refreshed.

SRT output (--format srt) is generated from transcript-raw.json. Text/markdown output uses transcript-sentences.json for natural sentence boundaries.

Workflow

When user provides a YouTube URL and wants the transcript:

  1. Run with --list first if the user hasn't specified a language, to show available options
  2. Always single-quote the URL when running the script — zsh treats ? as a glob wildcard, so an unquoted YouTube URL causes "no matches found": use 'https://www.youtube.com/watch?v=ID'
  3. Default: run with --chapters --speakers for the richest output (chapters + speaker identification)
  4. The script auto-saves cached data + output file and prints the file path
  5. For --speakers mode: after the script saves the raw file, follow the speaker identification workflow below to post-process with speaker labels

When user only wants a cover image or metadata, running the script with any option will also cache meta.json and imgs/cover.jpg.

When re-formatting the same video (e.g., first text then SRT), the cached data is reused — no re-fetch needed.

Chapter & Speaker Workflow

Chapters (--chapters)

The script parses chapter timestamps from the video description (e.g., 0:00 Introduction), segments the transcript by chapter boundaries, groups snippets into readable paragraphs, and saves as .md with a Table of Contents. No further processing needed.

If no chapter timestamps exist in the description, the transcript is output as grouped paragraphs without chapter headings.

Speaker Identification (--speakers)

Speaker identification requires AI processing. The script outputs a raw .md file containing:

  • YAML frontmatter with video metadata (title, channel, date, cover, description, language)
  • Video description (for speaker name extraction)
  • Chapter list from description (if available)
  • Raw transcript in SRT format (pre-computed start/end timestamps, token-efficient)

After the script saves the raw file, spawn a sub-agent (use a cheaper model like Sonnet for cost efficiency) to process speaker identification:

  1. Read the saved .md file
  2. Read the prompt template at {baseDir}/prompts/speaker-transcript.md
  3. Process the raw transcript following the prompt:
    • Identify speakers using video metadata (title → guest, channel → host, description → names)
    • Detect speaker turns from conversation flow, question-answer patterns, and contextual cues
    • Segment into chapters (use description chapters if available, else create from topic shifts)
    • Format with **Speaker Name:** labels, paragraph grouping (2-4 sentences), and [HH:MM:SS → HH:MM:SS] timestamps
  4. Overwrite the .md file with the processed transcript (keep the YAML frontmatter)

When --speakers is used, --chapters is implied — the processed output always includes chapter segmentation.

Error Cases

ErrorMeaning
Transcripts disabledVideo has no captions at all
No transcript foundRequested language not available
Video unavailableVideo deleted, private, or region-locked
IP blockedToo many requests, try again later
Age restrictedVideo requires login for age verification

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

baoyu-image-gen

AI image generation with OpenAI, Google, OpenRouter, DashScope, Jimeng, Seedream and Replicate APIs. Supports text-to-image, reference images, aspect ratios, and batch generation from saved prompt files. Sequential by default; use batch parallel generation when the user already has multiple prompts or wants stable multi-image throughput. Use when user asks to generate, create, or draw images.

Repository Source
14.1K10.8Kjimliu
General

baoyu-cover-image

No summary provided by upstream source.

Repository SourceNeeds Review
12.9K-jimliu
General

baoyu-article-illustrator

No summary provided by upstream source.

Repository SourceNeeds Review
12.8K-jimliu
General

baoyu-xhs-images

No summary provided by upstream source.

Repository SourceNeeds Review
12.8K-jimliu