audio-transcribe

Speech-to-text transcription using Whisper with word-level timestamps. Use when users ask to transcribe audio or video to text, generate subtitles, or recognize speech.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "audio-transcribe" with this command: npx skills add maxgent-ai/maxgent-plugin/maxgent-ai-maxgent-plugin-audio-transcribe

Audio Transcriber

Speech recognition using WhisperX with multi-language support and word-level timestamp alignment.

Prerequisites

Requires Python 3.12 (uv manages this automatically).

Usage

When the user wants to transcribe audio/video: $ARGUMENTS

Instructions

Step 1: Get input file

If the user has not provided an input file path, ask them to provide one.

Supported formats:

  • Audio: MP3, WAV, FLAC, M4A, OGG, etc.
  • Video: MP4, MKV, MOV, AVI, etc. (audio is extracted automatically)

Verify the file exists:

ls -la "$INPUT_FILE"

Step 2: Ask user for configuration

Warning: You MUST use AskUserQuestion to collect user preferences. Do not skip this step.

Use AskUserQuestion to collect the following:

  1. Model size: Choose the recognition model

    • Options:
      • "base - Balanced speed and accuracy (Recommended)"
      • "tiny - Fastest, lower accuracy"
      • "small - Faster, moderate accuracy"
      • "medium - Slower, higher accuracy"
      • "large-v2 - Slowest, highest accuracy"
  2. Language: What language is the audio?

    • Options:
      • "Auto-detect (Recommended)"
      • "Chinese (zh)"
      • "English (en)"
      • "Japanese (ja)"
      • "Other"
  3. Word-level alignment: Do you need word-level timestamps?

    • Options:
      • "Yes - Precise timing for each word (Recommended)"
      • "No - Sentence-level timing only (faster)"
  4. Output format: What format to output?

    • Options:
      • "TXT - Plain text with timestamps (Recommended)"
      • "SRT - Subtitle format"
      • "VTT - Web subtitle format"
      • "JSON - Structured data (with word-level info)"
  5. Output path: Where to save?

    • Default: same directory as input file, named <original_name>.txt (or matching format)

Step 3: Run transcription script

Use the transcribe.py script in the skill directory:

uv run /path/to/skills/audio-transcribe/transcribe.py "INPUT_FILE" [OPTIONS]

Parameters:

  • --model, -m: Model size (tiny/base/small/medium/large-v2)
  • --language, -l: Language code (en/zh/ja/...), auto-detect if not specified
  • --no-align: Skip word-level alignment
  • --no-vad: Disable VAD filtering (use if transcription has time jumps or missing segments)
  • --output, -o: Output file path
  • --format, -f: Output format (srt/vtt/txt/json)

Examples:

# Basic transcription (auto-detect language)
uv run skills/audio-transcribe/transcribe.py "video.mp4" -o "video.txt"

# Chinese transcription, output SRT subtitles
uv run skills/audio-transcribe/transcribe.py "audio.mp3" -l zh -f srt -o "subtitles.srt"

# Fast transcription, skip word alignment
uv run skills/audio-transcribe/transcribe.py "audio.wav" --no-align -o "transcript.txt"

# Use a larger model, output JSON (with word-level timestamps)
uv run skills/audio-transcribe/transcribe.py "speech.mp3" -m medium -f json -o "result.json"

# Disable VAD filtering (fix time jumps / missing segments)
uv run skills/audio-transcribe/transcribe.py "audio.mp3" --no-vad -o "transcript.txt"

Step 4: Present results

After transcription completes:

  1. Show the full output file path
  2. Display a preview of the transcription content
  3. Report total duration and segment count

Output format reference

TXT format

[00:00:00.000 - 00:00:03.500] This is the first sentence
[00:00:03.500 - 00:00:07.200] This is the second sentence

SRT format

1
00:00:00,000 --> 00:00:03,500
This is the first sentence

2
00:00:03,500 --> 00:00:07,200
This is the second sentence

JSON format (with word-level)

[
  {
    "start": 0.0,
    "end": 3.5,
    "text": "This is the first sentence",
    "words": [
      {"word": "This", "start": 0.0, "end": 0.5, "score": 0.95},
      ...
    ]
  }
]

Troubleshooting

Slow on first run:

  • WhisperX needs to download model files; first run will be slower
  • Subsequent runs use the cached model

Out of memory:

  • Use a smaller model (tiny or base)
  • Ensure the system has enough memory

Low recognition accuracy:

  • Try a larger model (medium or large-v2)
  • Explicitly specify the language instead of auto-detect

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

video-gen

No summary provided by upstream source.

Repository SourceNeeds Review
General

youtube-download

No summary provided by upstream source.

Repository SourceNeeds Review
General

browser

No summary provided by upstream source.

Repository SourceNeeds Review
General

media-understand

No summary provided by upstream source.

Repository SourceNeeds Review