ai-multimodal

AI Multimodal

Process audio, images, videos, documents, and generate images/videos using Google Gemini's multimodal API.

Setup

export GEMINI_API_KEY="your-key" # Get from https://aistudio.google.com/apikey pip install google-genai python-dotenv pillow

Quick Start

Verify setup: python scripts/check_setup.py

Analyze media: python scripts/gemini_batch_process.py --files <file> --task <analyze|transcribe|extract>

TIP: When you're asked to analyze an image, check if gemini command is available, then use "<prompt to analyze image>" | gemini -y -m gemini-2.5-flash command. If gemini command is not available, use python scripts/gemini_batch_process.py --files <file> --task analyze command. Generate content: python scripts/gemini_batch_process.py --task <generate|generate-video> --prompt "description"

Stdin support: You can pipe files directly via stdin (auto-detects PNG/JPG/PDF/WAV/MP3).

cat image.png | python scripts/gemini_batch_process.py --task analyze --prompt "Describe this"
python scripts/gemini_batch_process.py --files image.png --task analyze (traditional)

Models

Image generation: imagen-4.0-generate-001 (standard), imagen-4.0-ultra-generate-001 (quality), imagen-4.0-fast-generate-001 (speed)
Video generation: veo-3.1-generate-preview (8s clips with audio)
Analysis: gemini-2.5-flash (recommended), gemini-2.5-pro (advanced)

Scripts

gemini_batch_process.py : CLI orchestrator for transcribe|analyze|extract|generate|generate-video that auto-resolves API keys, picks sensible default models per task, streams files inline vs File API, and saves structured outputs (text/JSON/CSV/markdown plus generated assets) for Imagen 4 + Veo workflows.
media_optimizer.py : ffmpeg/Pillow-based preflight tool that compresses/resizes/converts audio, image, and video inputs, enforces target sizes/bitrates, splits long clips into hour chunks, and batch-processes directories so media stays within Gemini limits.
document_converter.py : Gemini-powered converter that uploads PDFs/images/Office docs, applies a markdown-preserving prompt, batches multiple files, auto-names outputs under docs/assets , and exposes CLI flags for model, prompt, auto-file naming, and verbose logging.
check_setup.py : Interactive readiness checker that verifies directory layout, centralized env resolver, required Python deps, and GEMINI_API_KEY availability/format, then performs a live Gemini API call and prints remediation instructions if anything fails.

Use --help for options.

References

Load for detailed guidance:

Topic File Description

Audio references/audio-processing.md

Audio formats and limits, transcription (timestamps, speakers, segments), non-speech analysis, File API vs inline input, TTS models, best practices, cost and token math, and concrete meeting/podcast/interview recipes.

Images references/vision-understanding.md

Vision capabilities overview, supported formats and models, captioning/classification/VQA, detection and segmentation, OCR and document reading, multi-image workflows, structured JSON output, token costs, best practices, and common product/screenshot/chart/scene use cases.

Image Gen references/image-generation.md

Imagen 4 and Gemini image model overview, generate_images vs generate_content APIs, aspect ratios and costs, text/image/both modalities, editing and composition, style and quality control, safety settings, best practices, troubleshooting, and common marketing/concept-art/UI scenarios.

Video references/video-analysis.md

Video analysis capabilities and supported formats, model/context choices, local/inline/YouTube inputs, clipping and FPS control, multi-video comparison, temporal Q&A and scene detection, transcription with visual context, token and cost guidance, and optimization/best-practice patterns.

Video Gen references/video-generation.md

Veo model matrix, text-to-video and image-to-video quick start, multi-reference and extension flows, camera and timing control, configuration (resolution, aspect, audio, safety), prompt design patterns, performance tips, limitations, troubleshooting, and cost estimates.

Limits

Formats: Audio (WAV/MP3/AAC, 9.5h), Images (PNG/JPEG/WEBP, 3.6k), Video (MP4/MOV, 6h), PDF (1k pages) Size: 20MB inline, 2GB File API

Resources

API Docs
Pricing

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

frontend-design-pro

ui-ux-pro-max

ui-styling

threejs