ai-vision

Multimodal UI understanding and single-step planning via OpenAI-compatible Responses APIs. Use when you need AIQuery/AIAssert and plan-next to extract UI element coordinates, validate UI assertions, summarize screenshots, or decide the next UI action from an image. External agents handle execution via adb/hdc and multi-step loops. Defaults to Doubao models but can be pointed at other multimodal providers via base URL, API key, and model name.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ai-vision" with this command: npx skills add httprunner/skills/httprunner-skills-ai-vision

AI Vision

Overview

This skill provides a standalone CLI to call multimodal models for UI querying, assertion, and single-step planning. It does not depend on device type; you supply a screenshot and receive structured output (coordinates, decisions, or next actions). Execution and multi-step loops are handled externally by agents using adb/hdc or other drivers. Prefer storing screenshots in ~/.eval/screenshots/ and add timestamps to avoid overwriting.

Path Convention

Canonical install and execution directory: ~/.agents/skills/ai-vision/. Run commands from this directory:

cd ~/.agents/skills/ai-vision

One-off (safe in scripts/loops from any working directory):

(cd ~/.agents/skills/ai-vision && npx tsx scripts/ai_vision.ts --help)

Model Configuration

Default Doubao configuration via environment variables:

  • ARK_BASE_URL (e.g. https://ark.cn-beijing.volces.com/api/v3)
  • ARK_API_KEY
  • ARK_MODEL_NAME

For non-Doubao providers, pass explicit flags:

  • --base-url, --api-key, --model

Default model if none provided: doubao-seed-1-6-vision-250815.

Script

Path: scripts/ai_vision.ts

Run with:

npx tsx scripts/ai_vision.ts --help

Log level (for troubleshooting raw model response):

npx tsx scripts/ai_vision.ts --log-level debug <command> [flags]

Output formatting:

  • When --log-json is set, logs are emitted as JSON.
  • Otherwise, the final result is pretty-printed JSON, and logs are colorized when TTY is available.

AIQuery

npx tsx scripts/ai_vision.ts query \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "请识别屏幕上的‘搜索’按钮,并返回其坐标"

AIAssert

npx tsx scripts/ai_vision.ts assert \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "当前页面包含搜索框"

plan-next (single-step planning)

npx tsx scripts/ai_vision.ts plan-next \
  --screenshot ~/.eval/screenshots/ui_YYYYMMDD_HHMMSS.png \
  --prompt "点击放大镜图标进入搜索页"

Output Notes

  • plan-next returns a normalized next action with absolute pixel coordinates.
  • If the model outputs relative coordinates (1000x1000), the script scales to screen pixels.
  • Combine with adb/hdc actions (e.g., adb shell input tap X Y) for device control.
  • Use --log-level debug to print the raw model response for troubleshooting.

Default Models (Doubao)

  • doubao-seed-1-8-251228
  • doubao-seed-1-6-vision-250815

References

  • references/doubao-api.md

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

android-adb

No summary provided by upstream source.

Repository SourceNeeds Review
General

feishu-bitable-task-manager

No summary provided by upstream source.

Repository SourceNeeds Review
General

result-bitable-reporter

No summary provided by upstream source.

Repository SourceNeeds Review