Grok Imagine Extended (xAI Image & Video Generation)

Generate images and videos from text prompts using the xAI API.

Image Generation

python3 {baseDir}/scripts/generate_image.py --prompt "your image description" --filename "output.png"

With options:

python3 {baseDir}/scripts/generate_image.py --prompt "a cyberpunk city at night" --filename "city.png" --resolution 2k --aspect-ratio 16:9

Single source image:

python3 {baseDir}/scripts/generate_image.py --prompt "make it a watercolor painting" --filename "edited.png" -i "/path/to/source.jpg"

Multiple source images (up to 3):

python3 {baseDir}/scripts/generate_image.py --prompt "combine into one scene" --filename "combined.png" -i img1.png -i img2.png

Text-to-video:

python3 {baseDir}/scripts/generate_image.py --prompt "a cat walking through flowers" --filename "cat.mp4" --video --duration 5

Image-to-video (animate a still):

python3 {baseDir}/scripts/generate_image.py --prompt "add gentle camera zoom and wind" --filename "animated.mp4" --video -i photo.jpg --duration 5

Select model with --model grok-imagine-image-pro. Video mode always uses grok-imagine-video.

Flag	Description
`--prompt`, `-p`	Text description (required)
`--filename`, `-f`	Output path (required)
`-i`	Input image for editing/animation (repeatable, max 3 for images, 1 for video)
`--model`, `-m`	Image model (default: grok-imagine-image)
`--aspect-ratio`, `-a`	1:1, 16:9, 9:16, 4:3, 3:4, etc.
`--resolution`, `-r`	Image: 1k/2k. Video: 480p/720p
`--n`	Number of images 1-10 (default 1)
`--video`	Generate video instead of image
`--duration`, `-d`	Video duration 1-15 seconds (default 5)
`--api-key`, `-k`	Override XAI_API_KEY

XAI_API_KEY env var
Or set skills."grok-imagine".apiKey / skills."grok-imagine".env.XAI_API_KEY in ~/.openclaw/openclaw.json
Or auto-read from ~/keys.txt

Use timestamps in filenames: 2026-03-01-cyberpunk-city.png
The script prints a MEDIA: line for OpenClaw to auto-attach on supported chat providers
Do not read the image back; report the saved path only
Image URLs from xAI are temporary; the script downloads them immediately
Video generation is async and polls until done (can take 1-5 minutes)
2k resolution returns PNG; 1k returns JPEG