PPT Audio To Video
Use this skill when the source video has narration audio but no usable slide visuals, and the final deliverable should be a slide-based lecture video.
Resolve bundled scripts relative to this skill directory. If the runtime has already opened this SKILL.md, prefer paths like scripts/extract_slide_outline.py and scripts/render_from_timing_csv.py instead of machine-specific absolute paths.
Core workflow
-
Inventory inputs.
- Confirm which of these exist: audio-only
mp4/m4a/mp3/wav,ppt/pptx,pdf, and any pre-rendered slide images. - Prefer an existing
pdfor image directory for rendering. Treatpptxas the source of slide text and as a fallback for export.
- Confirm which of these exist: audio-only
-
Prepare tools.
- Required for deterministic steps:
ffmpeg,ffprobe,pdftoppm. - Required for transcription:
whisper-clifromwhisper-cppplus a multilingual model such asggml-small.bin. - If only
pptxexists and nopdf/imagesexist, preferKeynoteorPowerPointexport on macOS. Usesofficeonly as fallback because profile or rendering issues are common.
- Required for deterministic steps:
-
Produce slide images.
- If
pdfexists, render it to images:pdftoppm -png -r 200 "$PDF" "$OUTDIR/slide" - If only
pptxexists, export topdfor slide images withKeynoteorPowerPoint, then continue frompdf. - Keep slide filenames ordered and stable, such as
slide-01.png,slide-02.png, ...
- If
-
Extract slide text.
- Run:
python3 scripts/extract_slide_outline.py \ --pptx "$PPTX" \ --out "$WORKDIR/slide_outline.csv" - Use the output to identify slide titles, distinctive keywords, and section changes.
- Run:
-
Extract clean audio for ASR.
- For audio-only
mp4, extract monowav:ffmpeg -y -i "$AUDIO_MP4" -ar 16000 -ac 1 -c:a pcm_s16le "$WORKDIR/audio.wav" - If the source is already
wav/mp3/m4a, convert to the same monowavform if needed.
- For audio-only
-
Transcribe with
whisper-cli.- Example:
whisper-cli -ng \ -m "$MODEL" \ -f "$WORKDIR/audio.wav" \ -l zh \ -ocsv -osrt -of "$WORKDIR/transcript" - Prefer
transcript.csvfor downstream parsing.transcript.srtis useful for manual review. - If GPU allocation fails on macOS, retry with
-ngto force CPU mode.
- Example:
-
Build
slide_timings.csv.- Do not average slide durations unless the user explicitly asks for it.
- Read the transcript and slide outline together, then create a monotonic timing plan by topic changes, section boundaries, and unique keywords.
- Use this schema:
slide,start_sec,end_sec,duration_sec,reason 1,0.000,15.000,15.000,opening title and agenda 2,15.000,100.000,85.000,architecture overview starts here - Keep slide numbers sequential and ensure
duration_sec = end_sec - start_sec. - Validate that the last
end_secmatches the audio duration or is within a small tolerance.
-
Render the final video.
- Run:
python3 scripts/render_from_timing_csv.py \ --images "$SLIDE_IMAGES_DIR" \ --timings "$WORKDIR/slide_timings.csv" \ --audio "$WORKDIR/audio.wav" \ --output "$OUT_VIDEO" - The script generates an
ffconcatfile, validates timing continuity, and callsffmpegto encode the finalmp4.
- Run:
-
Verify and iterate.
- Check output duration with
ffprobe. - If a slide cuts too early or too late, edit only the affected rows in
slide_timings.csvand rerun the render script. - Keep the transcript, outline, and timing CSV as reproducible working files.
- Check output duration with
Heuristics for timing alignment
- Use section-divider slides briefly. These slides usually hold for 5-20 seconds.
- Use the first segment that clearly switches topic as the next slide start.
- Prefer exact topic transitions over title-word matching. ASR often distorts proper nouns and product names.
- Let the model infer timings, but keep the render step deterministic through
slide_timings.csv. - When confidence is low, produce a first-cut video and tell the user which slide boundaries likely need review.
Common commands
Install dependencies on macOS if missing:
brew install ffmpeg poppler whisper-cpp
Typical multilingual model download:
mkdir -p .models
curl -L 'https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-small.bin' -o .models/ggml-small.bin
Bundled scripts
scripts/extract_slide_outline.pyExtract slide text frompptxinto CSV or JSON for timing analysis.scripts/render_from_timing_csv.pyValidate a timing CSV, generate anffconcat, and render the final video withffmpeg.