mk-youtube-audio-transcribe

Transcribe audio to text using local whisper.cpp. Use when user wants to convert audio/video to text, get transcription, or speech-to-text.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "mk-youtube-audio-transcribe" with this command: npx skills add kouko/monkey-knowledge-youtube-skills/kouko-monkey-knowledge-youtube-skills-mk-youtube-audio-transcribe

YouTube Audio Transcribe

Transcribe audio files to text using local whisper.cpp (no cloud API required).

Quick Start

/mk-youtube-audio-transcribe <audio_file> [model] [language] [--force]

Parameters

ParameterRequiredDefaultDescription
audio_fileYes-Path to audio file
modelNoautoModel: auto, tiny, base, small, medium, large-v3, belle-zh, kotoba-ja
languageNoautoLanguage code: en, ja, zh, auto (auto-detect)
--forceNofalseForce re-transcribe even if cached file exists

Examples

  • /mk-youtube-audio-transcribe /path/to/audio/video.m4a - Transcribe with auto model selection
  • /mk-youtube-audio-transcribe video.m4a auto zh - Auto-select best model for Chinese → belle-zh
  • /mk-youtube-audio-transcribe video.m4a auto ja - Auto-select best model for Japanese → kotoba-ja
  • /mk-youtube-audio-transcribe audio.mp3 small en - Use small model, force English
  • /mk-youtube-audio-transcribe podcast.wav medium ja - Use medium model (explicit), Japanese

How it Works

  1. Execute: {baseDir}/scripts/transcribe.sh "<audio_file>" "<model>" "<language>"
  2. Auto-download model if not found (with progress)
  3. Convert audio to 16kHz mono WAV using ffmpeg
  4. Run whisper-cli for transcription
  5. Save full JSON to {baseDir}/data/<filename>.json
  6. Save plain text to {baseDir}/data/<filename>.txt
  7. Return file paths and metadata
┌─────────────────────────────┐
│      transcribe.sh          │
│  audio_file, [model], [lang]│
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│   ffmpeg: convert to WAV    │
│   16kHz, mono, pcm_s16le    │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│   whisper-cli: transcribe   │
│   with Metal acceleration   │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│   Save to files             │
│   .json (full) + .txt       │
└──────────────┬──────────────┘
               │
               ▼
┌─────────────────────────────┐
│   Return file paths         │
│   {file_path, text_file_path}│
└─────────────────────────────┘

Output Format

Success:

{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "text_file_path": "{baseDir}/data/20091025__VIDEO_ID.txt",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "char_count": 12345,
  "line_count": 100,
  "text_char_count": 10000,
  "text_line_count": 50,
  "cached": false,
  "video_id": "dQw4w9WgXcQ",
  "title": "Video Title",
  "channel": "Channel Name",
  "url": "https://www.youtube.com/watch?v=dQw4w9WgXcQ"
}

Cache hit (returns existing transcription):

{
  "status": "success",
  "file_path": "{baseDir}/data/20091025__VIDEO_ID.json",
  "cached": true,
  ...
}

Error (general):

{
  "status": "error",
  "message": "Error description"
}

Error (unknown model):

{
  "status": "error",
  "error_code": "UNKNOWN_MODEL",
  "message": "Unknown model: invalid-name",
  "available_models": ["tiny", "base", "small", "medium", "large-v3", "large-v3-turbo", "belle-zh", "kotoba-ja", "kotoba-ja-q5"]
}

When you receive UNKNOWN_MODEL error: suggest a valid model from the available_models list.

Error (model not found):

{
  "status": "error",
  "error_code": "MODEL_NOT_FOUND",
  "message": "Model 'medium' not found. Please download it first.",
  "model": "medium",
  "model_size": "1.4GB",
  "download_url": "https://huggingface.co/...",
  "download_command": "curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}

When you receive MODEL_NOT_FOUND error:

  1. Inform user: "Downloading model '{model}' ({model_size})..."
  2. Execute download_command using Bash tool with timeout: 1800000 (30 minutes)
  3. After download completes: re-run the original transcribe command

Error (model corrupted):

{
  "status": "error",
  "error_code": "MODEL_CORRUPTED",
  "message": "Model 'medium' is corrupted or incomplete. Please re-download.",
  "model": "medium",
  "model_size": "1.4GB",
  "expected_sha256": "6c14d5adee5f86394037b4e4e8b59f1673b6cee10e3cf0b11bbdbee79c156208",
  "actual_sha256": "def456...",
  "model_path": "/path/to/models/ggml-medium.bin",
  "download_command": "rm '/path/to/models/ggml-medium.bin' && curl -L --progress-bar -o '/path/to/models/ggml-medium.bin' 'https://...' 2>&1"
}

When you receive MODEL_CORRUPTED error:

  1. Inform user: "Model '{model}' is corrupted. Re-downloading ({model_size})..."
  2. Execute download_command (removes corrupted file and re-downloads) using Bash tool with timeout: 1800000 (30 minutes)
  3. After download completes: re-run the original transcribe command

Output Fields

FieldDescription
file_pathAbsolute path to JSON file (with segments)
text_file_pathAbsolute path to plain text file
languageDetected language code
durationAudio duration
modelModel used for transcription
char_countCharacter count of JSON file
line_countLine count of JSON file
text_char_countCharacter count of plain text file
text_line_countLine count of plain text file
video_idYouTube video ID (from centralized metadata store)
titleVideo title (from centralized metadata store)
channelChannel name (from centralized metadata store)
urlFull video URL (from centralized metadata store)

Filename Format

Output files preserve the input audio filename's unified naming format with date prefix: {YYYYMMDD}__{video_id}.{ext}

Example: 20091025__dQw4w9WgXcQ.json

JSON File Format

The JSON file at file_path contains:

{
  "text": "Full transcription text...",
  "language": "en",
  "duration": "3:32",
  "model": "medium",
  "segments": [
    {
      "start": "00:00:00.000",
      "end": "00:00:05.000",
      "text": "First segment..."
    }
  ]
}

Models

Standard Models

ModelSizeRAMSpeedAccuracy
auto---Auto-select based on language (default)
tiny74MB~273MBFastestLow
base141MB~388MBFastMedium
small465MB~852MBModerateGood
medium1.4GB~2.1GBSlowHigh
large-v32.9GB~3.9GBSlowestBest
large-v3-turbo1.5GB~2.1GBModerateHigh (optimized for speed)

Language-Specialized Models

ModelLanguageSizeDescription
belle-zhChinese1.5GBBELLE-2 Chinese-specialized model
kotoba-jaJapanese1.4GBkotoba-tech Japanese-specialized model
kotoba-ja-q5Japanese513MBQuantized version (faster, smaller)

Auto-Selection (model=auto)

When model is auto (default), the system automatically selects the best model based on language:

LanguageAuto-Selected Model
zhbelle-zh (Chinese-specialized)
jakotoba-ja (Japanese-specialized)
othersmedium (general purpose)

Example: /mk-youtube-audio-transcribe video.m4a auto zh → uses belle-zh

Notes

  • File caching: If transcription already exists for this video, it will be reused (returns cached: true)
  • Force refresh: Use --force flag to re-transcribe even if cached file exists
  • Specify language for best results - enables auto-selection of specialized models (zh→belle-zh, ja→kotoba-ja)
  • Use Read tool to get file content from file_path or text_file_path
  • Models must be downloaded before first use - returns MODEL_NOT_FOUND error with download command
  • Uses Metal acceleration on macOS for faster processing
  • Supports auto language detection
  • Audio is converted to 16kHz WAV for optimal results
  • Requires ffmpeg and whisper-cli (pre-built in bin/)

Model Download

Models must be downloaded before transcription. When you receive a MODEL_NOT_FOUND error, execute the download_command with timeout: 1800000.

# In terminal (to see progress bar)
./scripts/download-model.sh medium      # 1.4GB
./scripts/download-model.sh belle-zh    # 1.5GB (Chinese)
./scripts/download-model.sh kotoba-ja   # 1.4GB (Japanese)
./scripts/download-model.sh --list      # Show all available models

Next Step

After transcription completes, invoke /mk-youtube-transcript-summarize with the text_file_path from the output to generate a structured summary:

/mk-youtube-transcript-summarize <text_file_path>

IMPORTANT: Always use the Skill tool to invoke /mk-youtube-transcript-summarize. Do NOT generate summaries directly without loading the skill — it contains critical rules for compression ratio, section structure, data preservation, and language handling.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Speech to Text

Transcribe or translate audio files to text using a public Hugging Face Whisper Space over Gradio. Use when the user sends voice notes, audio attachments, me...

Registry SourceRecently Updated
920Profile unavailable
General

Telegram Voice Transcribe

Transcribe Telegram voice messages and audio notes into text using the OpenAI Whisper API. Use when (1) a user sends a voice message or audio note via Telegr...

Registry SourceRecently Updated
2460Profile unavailable
Coding

SpeakNotes: YouTube, Audio & Document Summaries

Use when OpenClaw needs to call SpeakNotes API routes directly using an API key and generate transcripts/summaries from YouTube URLs, media files, or documen...

Registry SourceRecently Updated
1300Profile unavailable
General

Whisper Transcribe

Transcribe audio files to text using OpenAI Whisper. Supports speech-to-text with auto language detection, multiple output formats (txt, srt, vtt, json), batch processing, and model selection (tiny to large). Use when transcribing audio recordings, podcasts, voice messages, lectures, meetings, or any audio/video file to text. Handles mp3, wav, m4a, ogg, flac, webm, opus, aac formats.

Registry SourceRecently Updated
1K2Profile unavailable