Video Subtitle Generator
Multilingual video subtitle generation and translation toolkit built on WhisperX.
Features
- Speech transcription: Extract audio from video and transcribe it into subtitles with automatic source language detection
- Multilingual translation: Translate subtitles from any source language into a configurable target language
- Bilingual subtitles: Generate source + target bilingual subtitles
Prerequisites
- Python 3.9+
- ffmpeg (required by WhisperX for audio extraction)
# macOS
brew install ffmpeg
# Ubuntu / Debian
sudo apt install ffmpeg
# Windows (Chocolatey)
choco install ffmpeg
# Windows (Scoop)
scoop install ffmpeg
Resource requirements
Before running, confirm the user is aware of the following costs:
| Resource | Details |
|---|---|
| Disk | ffmpeg ~80 MB; Python packages (torch, whisperx, etc.) 2–5 GB; Whisper model weights 39 MB – 1.5 GB depending on model size |
| CPU / GPU | WhisperX runs model inference locally. A CUDA GPU is strongly recommended for medium and large models. CPU and Apple MPS also work but are significantly slower |
| Network / API | Translation step calls a remote LLM API and incurs token-based charges. No network is needed for the transcription step once the model is downloaded |
Always confirm with the user before installing packages or downloading models, as these operations consume storage and bandwidth.
Translation requires an LLM API and will incur costs. Before executing the translation step:
- Ask the user for the API provider, key, and base URL — or present any auto-discovered configuration for review
- Inform the user that translation calls a remote LLM and will consume tokens (i.e. real money)
- Do NOT proceed with translation until the user explicitly confirms the provider and acknowledges the cost
Usage
1. Environment setup
# Install dependencies (requires ~2–5 GB disk space for PyTorch and WhisperX)
pip install -r requirements.txt
# Set the API key (used for translation)
# macOS / Linux
export OPENAI_API_KEY="your-api-key"
export OPENAI_BASE_URL="https://openrouter.ai/api/v1" # Optional, defaults to OpenRouter
# Windows (PowerShell)
$env:OPENAI_API_KEY="your-api-key"
$env:OPENAI_BASE_URL="https://openrouter.ai/api/v1"
On Windows, use
pythoninstead ofpython3in all commands below.
2. Transcribe video (auto-detect language)
python3 scripts/transcribe.py "/path/to/video.mp4" -o ./output -m small
Output: video.{detected_lang}.srt (e.g. video.en.srt, video.ja.srt)
Arguments:
-o: Output directory-m: Model size (tiny,base,small,medium,large)-d: Device (cuda,cpu,mps), auto-detected by default-l: Force source language code (e.g.en,ja,zh). Auto-detect if omitted
3. Batch-process a directory
python3 scripts/transcribe.py "/path/to/video/folder" -o ./output -m small
4. Translate subtitles
Cost warning: This step calls a remote LLM API. Ensure the user has confirmed the API provider, key, and billing awareness before running.
# Translate to Chinese (default)
python3 scripts/translate.py ./output -o ./translated
# Translate to Japanese
python3 scripts/translate.py ./output -o ./translated -t ja
# Only generate bilingual subtitles
python3 scripts/translate.py ./output -o ./translated --bilingual
Arguments:
-t,--target-lang: Target language code (default:zh)--bilingual: Generate bilingual (source + target) subtitles--target-only: Generate target-language-only subtitles--model: Translation model (default:google/gemini-3-flash-preview)--batch-size: Batch size (default:10)
When neither --bilingual nor --target-only is specified, both are generated.
5. Run the full pipeline
python3 scripts/run.py
# Customize via environment variables
VIDEO_DIR="/path/to/videos" TARGET_LANG=en python3 scripts/run.py
Environment variables for run.py:
VIDEO_DIR: Video source directory (default:./videos)OUTPUT_DIR: Transcription output directory (default:./output)TRANSLATED_DIR: Translation output directory (default:./translated)TARGET_LANG: Target language code (default:zh)WHISPER_MODEL: Whisper model size (default:medium)
Model selection
| Model | Size | Speed | Accuracy | Best for |
|---|---|---|---|---|
| tiny | 39 MB | Fastest | Fair | Quick tests |
| base | 74 MB | Fast | Good | Real-time usage |
| small | 244 MB | Medium | Good | Recommended |
| medium | 769 MB | Slower | Very good | Higher quality |
| large | 1550 MB | Slow | Best | Professional use |
Output files
For each video, the tool generates:
*.{lang}.srt- Source-language subtitles (language auto-detected, e.g.video.en.srt)*.json- Full transcription data with timestamps*.bilingual.srt- Bilingual subtitles (source + target) after translation*.{target}.srt- Target-language-only subtitles after translation (e.g.video.zh.srt)
Script overview
scripts/transcribe.py
Uses WhisperX for transcription and supports:
- Automatic source language detection (or manual override via
-l) - Timestamp alignment
- Batch processing with model reuse across files
scripts/translate.py
Uses an LLM API to translate subtitles and supports:
- Configurable target language (
-t) - Batch translation for better efficiency
- Bilingual or target-language-only output
- Custom models and API endpoints
- Automatic retry with exponential backoff on API failures
scripts/run.py
Cross-platform one-command runner that executes the transcription and translation pipeline automatically. Paths, target language, and model size are configurable via environment variables.