image-to-video-runcomfy

Image-to-video generation on RunComfy. This image-to-video skill turns any still image into a short video clip via the RunComfy Model API. The image-to-video pipeline supports portrait animation, product reveal, scene motion, and synchronized-audio image-to-video output. Calls the right image-to-video endpoint for the user's intent (general image-to-video, lip-sync image-to-video, multi-modal image-to-video) through `runcomfy run <model>/image-to-video`. Triggers on "image to video", "image-to-video", "i2v", "animate image", "image2video", "make a video from image", "still to video", "still-to-video", or any explicit ask for image-to-video conversion.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "image-to-video-runcomfy" with this command: npx skills add kalvinrv/image-to-video-runcomfy

🫧 Image-to-Video — Pro Pack on RunComfy

runcomfy.com · docs · Image-to-video models

Image-to-video generation on RunComfy. This skill is the canonical image-to-video entry point for the RunComfy Model API: give it a still image and a motion description, and it returns a short video clip. Image-to-video on RunComfy means turning any image — portrait, product photo, environment, illustration — into a video, with the motion driven by your prompt.

What "image-to-video" means here

Image-to-video (often abbreviated i2v or image2video) is the task of generating a short video starting from a single still image. The image fixes the look — face, wardrobe, product, scene geometry — and the prompt drives the motion. Image-to-video is distinct from text-to-video (no input image) and from video-to-video (which transforms an existing clip).

Image-to-video on RunComfy supports three patterns:

  • General image-to-video: animate any still — portrait drift, product reveal, environment motion, illustration coming alive. The default image-to-video pipeline.
  • Lip-sync image-to-video: a custom voiceover drives mouth movement on a generated talking-head image-to-video clip. Input: image + audio. Output: lip-synced image-to-video.
  • Multi-modal image-to-video: combine subject image + reference scene video + reference voice audio into one image-to-video output.

This skill picks the right image-to-video endpoint for the user's intent and calls runcomfy run <model>/image-to-video with the matching schema.

When to use image-to-video on RunComfy

Pick image-to-video on RunComfy whenever:

  • You have a still image and want it to move — image-to-video is the right task.
  • You want identity-stable image-to-video — the face / product / brand from your input image must survive into the output video.
  • You want fast iteration on image-to-video — RunComfy hosts the GPU; you don't deploy or rent.
  • You're building image-to-video at scale — multi-language image-to-video dubs, multi-shot image-to-video sequences, batch image-to-video jobs.

If the user said "image to video", "i2v", "animate this image", "image2video", "make a video from this", or showed an image and asked for video — route here.

Image-to-video routes

User intentImage-to-video modelWhy
Default image-to-video — portraits, products, environmentshappyhorse-1-0/image-to-video#1 on Arena (Elo 1392 i2v); strong identity preservation; native synchronized audio in image-to-video output
Image-to-video with custom voiceover lip-syncwan-ai/wan-2-7/text-to-video + audio_urlDrives lip-sync on the image-to-video frame from your audio file
Multi-modal image-to-video (image + ref video + ref audio)bytedance/seedance-v2/proMulti-input image-to-video with up to 9 image refs and 3 audio refs

The agent reads this table, classifies the user's image-to-video intent, and picks the matching endpoint.

Prerequisites

  1. RunComfy CLInpm i -g @runcomfy/cli
  2. RunComfy accountruncomfy login opens a browser device-code flow.
  3. CI / containers — set RUNCOMFY_TOKEN=<token>.
  4. A source image URL — JPEG/PNG/WebP, min 300px, ≤10MB; aspect 1:2.5 to 2.5:1 for the default image-to-video model.

Default image-to-video — HappyHorse 1.0 i2v

The default image-to-video endpoint. Use for any general image-to-video task: portrait drift, product reveal, environment motion, character animation. Image-to-video output includes synchronized audio in the same generation pass.

Schema

FieldTypeRequiredDefaultNotes
image_urlstringyesThe source still for image-to-video. JPEG/PNG/WebP, min 300px, aspect 1:2.5–2.5:1, ≤10MB.
promptstringyesMotion / camera / lighting description for the image-to-video output. ≤5000 chars.
resolutionenumno1080P720P or 1080P.
durationintno53–15 seconds per image-to-video clip.
seedintno0Reuse for image-to-video variant comparisons.
watermarkboolnotrueProvider watermark on image-to-video output.

Output aspect of the image-to-video clip equals input image aspect.

Invoke

runcomfy run happyhorse/happyhorse-1-0/image-to-video \
  --input '{
    "image_url": "https://.../portrait.jpg",
    "prompt": "Gentle camera drift around the subject'\''s face, subtle breathing motion, identity-stable features, soft natural light."
  }' \
  --output-dir <absolute/path>

Lip-sync image-to-video — custom voiceover

When the image-to-video output needs to lip-sync to a custom audio track, use Wan 2.7 with audio_url. The image-to-video clip is generated around your voiceover so mouth movement matches.

FieldTypeRequiredNotes
promptstringyesDescribe the talking-head shot for the image-to-video output.
audio_urlstringyesWAV/MP3, 3–30s, ≤15MB. Drives lip-sync on the image-to-video frame.
aspect_ratioenumno16:9, 9:16, 1:1, 4:3, 3:4.
resolutionenumno720p or 1080p.
durationenumno2–15 seconds. Match audio length for clean image-to-video lip-sync.
runcomfy run wan-ai/wan-2-7/text-to-video \
  --input '{
    "prompt": "Medium close-up, soft key light, locked tripod, shallow DOF.",
    "audio_url": "https://.../voiceover-en.mp3",
    "duration": 12,
    "aspect_ratio": "9:16"
  }' \
  --output-dir <absolute/path>

For multi-language image-to-video dubs: same prompt, swap audio_url per call, lock seed for visual consistency across all image-to-video outputs.

Multi-modal image-to-video — image + ref video + ref audio

When the image-to-video output should fuse a subject image with a scene reference and voice reference, use Seedance 2.0 Pro. Multi-modal image-to-video accepts up to 9 image refs.

FieldTypeRequiredNotes
promptstringyesDescription for the image-to-video output. EN ≤1000 words.
image_urlarrayyes0–9 source images for image-to-video. First is the primary subject.
video_urlarrayno0–3 reference clips (2–15s each) for image-to-video scene cues.
audio_urlarrayno0–3 reference audio (2–15s, <15MB each) for image-to-video voice cues.
durationintno4–15 seconds.
resolutionenumno480p or 720p.
runcomfy run bytedance/seedance-v2/pro \
  --input '{
    "prompt": "Subject from image 1 walks through the scene from video 1, voice from audio 1.",
    "image_url": ["https://.../subject.jpg"],
    "video_url": ["https://.../scene.mp4"],
    "audio_url": ["https://.../voice.mp3"],
    "duration": 8
  }' \
  --output-dir <absolute/path>

Prompting image-to-video — what works

Image-to-video prompts behave differently from text-to-video prompts. The image already fixes the look — your prompt should drive motion, not redescribe the image.

  • Lead with motion verbs. "drift", "dolly in", "orbit", "tilt up", "blink", "breathe" — front-load what's MOVING in the image-to-video output.
  • Don't restate the image. The image-to-video model sees the input. Spend tokens on what changes, not what already exists.
  • Preservation goals explicit. "identity-stable features", "packaging unchanged", "background geometry stable" — tell the image-to-video model what NOT to change.
  • One beat per image-to-video clip. Single primary motion (orbit OR dolly OR tilt OR character action). Compound motion drifts.
  • Lighting evolution. "rim light intensifying", "shadows shortening as camera rises" — image-to-video output reads lighting cues well.

Image-to-video FAQ

What's the max duration of an image-to-video clip? 15 seconds across all image-to-video routes here. For longer image-to-video sequences, generate multiple clips and stitch.

What image formats does image-to-video accept? JPEG, PNG, WebP. Min 300px, ≤10MB, aspect 1:2.5 to 2.5:1.

Does image-to-video preserve face identity? Yes — the default image-to-video model has strong identity preservation. For best identity hold, the face should fill at least 5% of the frame in the input image.

Can image-to-video include audio? Yes. The default image-to-video model generates synchronized audio in the same pass. The lip-sync image-to-video route accepts your custom audio. The multi-modal image-to-video route accepts reference audio.

Image-to-video vs text-to-video on RunComfy? Image-to-video starts from your image (look fixed). Text-to-video starts from your prompt only (look generated). Use image-to-video when you have an exact reference; use text-to-video for novel content.

Image-to-video output resolution? 720p or 1080p depending on the route.

Limitations

  • Image-to-video clip length is 15s per call. Longer image-to-video output requires stitching multiple calls.
  • Image-to-video output aspect = input image aspect on the default route. For independent reframing, crop the input first.
  • Image-to-video doesn't blend across routes in one call. If you need multi-modal image-to-video + custom voiceover lip-sync in one clip, that's two image-to-video calls plus a stitch.

Exit codes

codemeaning
0image-to-video succeeded
64bad CLI args
65bad input JSON for image-to-video / schema mismatch
69upstream 5xx
75retryable: timeout / 429
77not signed in or token rejected

Full reference: docs.runcomfy.com/cli/troubleshooting.

How it works

The skill picks one of three image-to-video endpoints based on user intent (general image-to-video, lip-sync image-to-video, or multi-modal image-to-video) and invokes runcomfy run <endpoint> with the matching JSON body. The CLI POSTs to the RunComfy Model API, polls the image-to-video request status every 2 seconds, and downloads the resulting image-to-video file from the *.runcomfy.net / *.runcomfy.com URL into --output-dir. Ctrl-C cancels the in-flight image-to-video request.

Security & Privacy

  • Token storage: runcomfy login writes the API token to ~/.config/runcomfy/token.json with mode 0600. Set RUNCOMFY_TOKEN env var to bypass the file in CI.
  • Input boundary: the image-to-video prompt is passed as JSON via --input. The CLI does NOT shell-expand. No shell-injection surface.
  • Third-party content: image / video / audio URLs are fetched by the RunComfy server. Treat external URLs as untrusted; image-based prompt injection is a known risk for any image-to-video model.
  • Outbound endpoints: only model-api.runcomfy.net and *.runcomfy.net / *.runcomfy.com. No telemetry.
  • Generated-file size cap: the CLI aborts any image-to-video download > 2 GiB.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Tcb Harness

Full-lifecycle CloudBase (TCB) development harness for WeChat Mini Programs, Web apps, or both. Drives structured Spec to Design to Coding (TDD) to Testing t...

Registry SourceRecently Updated
Coding

copilot-team-scaffold

Initialize a multi-agent AI development framework for any project. Creates .github/ structure with agents, hooks, instructions, prompts, and planning-with-fi...

Registry SourceRecently Updated
Coding

Change Effect Analysis

Trace the blast radius of a legacy code change and produce a test placement plan with pinch points. Use whenever a developer needs to decide WHERE to write t...

Registry SourceRecently Updated
Coding

Alibabacloud Dataworks Metadata

DataWorks metadata Skill for Alibaba Cloud — browse Data Map metadata and perform non-destructive writes via Aliyun CLI. READ scope: list/get catalogs, datab...

Registry SourceRecently Updated