YouTube Audio Transcribe

Transcribe audio files to text using local whisper.cpp (no cloud API required).

Quick Start

/mk-youtube-audio-transcribe <audio_file> [model] [language] [--force]

Parameters

Parameter	Required	Default	Description
audio_file	Yes	-	Path to audio file
model	No	auto	Model: auto, tiny, base, small, medium, large-v3, belle-zh, kotoba-ja
language	No	auto	Language code: en, ja, zh, auto (auto-detect)
--force	No	false	Force re-transcribe even if cached file exists

Examples

/mk-youtube-audio-transcribe /path/to/audio/video.m4a - Transcribe with auto model selection
/mk-youtube-audio-transcribe video.m4a auto zh - Auto-select best model for Chinese → belle-zh
/mk-youtube-audio-transcribe video.m4a auto ja - Auto-select best model for Japanese → kotoba-ja
/mk-youtube-audio-transcribe audio.mp3 small en - Use small model, force English
/mk-youtube-audio-transcribe podcast.wav medium ja - Use medium model (explicit), Japanese

How it Works

Execute: {baseDir}/scripts/transcribe.sh "<audio_file>" "<model>" "<language>"
Auto-download model if not found (with progress)
Convert audio to 16kHz mono WAV using ffmpeg
Run whisper-cli for transcription
Save full JSON to {baseDir}/data/<filename>.json
Save plain text to {baseDir}/data/<filename>.txt
Return file paths and metadata

┌─────────────────────────────┐
│      transcribe.sh          │
│  audio_file, [model], [lang]│
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│   ffmpeg: convert to WAV    │
│   16kHz, mono, pcm_s16le    │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│   whisper-cli: transcribe   │
│   with Metal acceleration   │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│   Save to files             │
│   .json (full) + .txt       │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│   Return file paths         │
│   {file_path, text_file_path}│
└─────────────────────────────┘

Output Format

Success:

{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "text_file_path": "{baseDir}/data/20091025__VIDEO_ID.txt",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "char_count": 12345,
  "line_count": 100,
  "text_char_count": 10000,
  "text_line_count": 50,
  "cached": false,
  "video_id": "dQw4w9WgXcQ",
  "title": "Video Title",
  "channel": "Channel Name",
  "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}

Cache hit (returns existing transcription):

{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "cached": true,
  ...
}

Error (general):

{
  "status": "error",
  "message": "Error description"
}

Error (unknown model):

{
  "status": "error",
  "error_code": "UNKNOWN_MODEL",
  "message": "Unknown model: invalid-name",
  "available_models": ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo", "belle-zh", "kotoba-ja", "kotoba-ja-q5"]
}

When you receive UNKNOWN_MODEL error: suggest a valid model from the available_models list.

Error (model not found):

{
  "status": "error",
  "error_code": "MODEL_NOT_FOUND",
  "message": "Model 'medium' not found. Please download it first.",
  "model": "medium",
  "model_size": "1.4GB",
  "download_url": "https://huggingface.co/...",
  "download_command": "curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}

When you receive MODEL_NOT_FOUND error:

Inform user: "Downloading model '{model}' ({model_size})..."
Execute download_command using Bash tool with timeout: 1800000 (30 minutes)
After download completes: re-run the original transcribe command

Error (model corrupted):

{
  "status": "error",
  "error_code": "MODEL_CORRUPTED",
  "message": "Model 'medium' is corrupted or incomplete. Please re-download.",
  "model": "medium",
  "model_size": "1.4GB",
  "expected_sha256": "6c14d5adee5f86394037b4e4e8b59f1673b6cee10e3cf0b11bbdbee79c156208",
  "actual_sha256": "def456...",
  "model_path": "/path/to/models/ggml-medium.bin",
  "download_command": "rm '/path/to/models/ggml-medium.bin' && curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}

When you receive MODEL_CORRUPTED error:

Inform user: "Model '{model}' is corrupted. Re-downloading ({model_size})..."
Execute download_command (removes corrupted file and re-downloads) using Bash tool with timeout: 1800000 (30 minutes)
After download completes: re-run the original transcribe command

Output Fields

Field	Description
`file_path`	Absolute path to JSON file (with segments)
`text_file_path`	Absolute path to plain text file
`language`	Detected language code
`duration`	Audio duration
`model`	Model used for transcription
`char_count`	Character count of JSON file
`line_count`	Line count of JSON file
`text_char_count`	Character count of plain text file
`text_line_count`	Line count of plain text file
`video_id`	YouTube video ID (from centralized metadata store)
`title`	Video title (from centralized metadata store)
`channel`	Channel name (from centralized metadata store)
`url`	Full video URL (from centralized metadata store)

Filename Format

Output files preserve the input audio filename's unified naming format with date prefix: {YYYYMMDD}__{video_id}.{ext}

Example: 20091025__dQw4w9WgXcQ.json

JSON File Format

The JSON file at file_path contains:

{
  "text": "Full transcription text...",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "segments": [
    {
      "start": "00:00:00.000",
      "end": "00:00:05.000",
      "text": "First segment..."
    }
  ]
}

Models

Standard Models

Model	Size	RAM	Speed	Accuracy
auto	-	-	-	Auto-select based on language (default)
tiny	74MB	~273MB	Fastest	Low
base	141MB	~388MB	Fast	Medium
small	465MB	~852MB	Moderate	Good
medium	1.4GB	~2.1GB	Slow	High
large-v3	2.9GB	~3.9GB	Slowest	Best
large-v3-turbo	1.5GB	~2.1GB	Moderate	High (optimized for speed)

Language-Specialized Models

Model	Language	Size	Description
belle-zh	Chinese	1.5GB	BELLE-2 Chinese-specialized model
kotoba-ja	Japanese	1.4GB	kotoba-tech Japanese-specialized model
kotoba-ja-q5	Japanese	513MB	Quantized version (faster, smaller)

Auto-Selection (model=auto)

When model is auto (default), the system automatically selects the best model based on language:

Language	Auto-Selected Model
zh	belle-zh (Chinese-specialized)
ja	kotoba-ja (Japanese-specialized)
others	medium (general purpose)

Example: /mk-youtube-audio-transcribe video.m4a auto zh → uses belle-zh

Notes

File caching: If transcription already exists for this video, it will be reused (returns cached: true)
Force refresh: Use --force flag to re-transcribe even if cached file exists
Specify language for best results - enables auto-selection of specialized models (zh→belle-zh, ja→kotoba-ja)
Use Read tool to get file content from file_path or text_file_path
Models must be downloaded before first use - returns MODEL_NOT_FOUND error with download command
Uses Metal acceleration on macOS for faster processing
Supports auto language detection
Audio is converted to 16kHz WAV for optimal results
Requires ffmpeg and whisper-cli (pre-built in bin/)

Model Download

Models must be downloaded before transcription. When you receive a MODEL_NOT_FOUND error, execute the download_command with timeout: 1800000.

# In terminal (to see progress bar)
./scripts/download-model.sh medium      # 1.4GB
./scripts/download-model.sh belle-zh    # 1.5GB (Chinese)
./scripts/download-model.sh kotoba-ja   # 1.4GB (Japanese)
./scripts/download-model.sh --list      # Show all available models

Next Step

After transcription completes, invoke /mk-youtube-transcript-summarize with the text_file_path from the output to generate a structured summary:

/mk-youtube-transcript-summarize <text_file_path>

IMPORTANT: Always use the Skill tool to invoke /mk-youtube-transcript-summarize. Do NOT generate summaries directly without loading the skill — it contains critical rules for compression ratio, section structure, data preservation, and language handling.

mk-youtube-audio-transcribe

Safety Notice

Copy this and send it to your AI assistant to learn