UI Element Ops

Parse one or more screenshots into a machine-readable JSON schema with:

type (normalized UI element type)
bbox_px and bbox_norm
text (OCR/caption content when available)
clickable flag
optional overlay image with labeled boxes
desktop actions via scripts/operate_ui.py (click/type/key/hotkey/screenshot)
element query and orchestration via scripts/operate_ui.py (find, wait)
coordinate calibration profile for multi-display/DPI/window offset (calibrate)

Quick Start

skills/ui-element-ops/scripts/bootstrap_omniparser_env.sh "$PWD"

skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/1.jpeg

skills/ui-element-ops/scripts/capture_and_parse.sh

Confirm screenshot path and desired output path.
Run scripts/bootstrap_omniparser_env.sh when .venv or OmniParser weights are missing.
Run scripts/run_parse_ui.sh for standard parsing.
Report absolute output paths and summary counts: total, clickable, by_type.
Call out obvious quality risks for tiny text or dense icon layouts.
Execute desktop actions when requested:
- list elements: python3 skills/ui-element-ops/scripts/operate_ui.py list --elements <json>
- find elements: python3 skills/ui-element-ops/scripts/operate_ui.py find --elements <json> --type button --text-contains login
- wait for appear/disappear: python3 skills/ui-element-ops/scripts/operate_ui.py wait --elements <json> --state appear --text-contains continue
- click by id: python3 skills/ui-element-ops/scripts/operate_ui.py click --elements <json> --id e_0001
- screenshot: python3 skills/ui-element-ops/scripts/operate_ui.py screenshot (defaults to user tmp dir)
- calibrate coordinates: python3 skills/ui-element-ops/scripts/operate_ui.py calibrate --parsed-size <w> <h> --actual-size <w> <h>

Main JSON output:
- schema_version, pipeline, image, counts, elements
- each element has id, type, bbox_px, bbox_norm, text, clickable
Overlay PNG output:
- same screenshot with labeled detection boxes

Missing dependencies or weights: run bootstrap script again.
Permission/cache errors under $HOME: keep temporary caches under /tmp (handled by run script).
CPU-only machine: expect slower inference.
Performance note: parse/capture-and-parse commands are heavy; avoid very tight loops and reuse recent elements.json when possible.
Headless environment limitation:
- usable without GUI: parse/list/find/wait/calibrate on existing files.
- requires GUI session: click/click-xy/type/key/hotkey/screenshot/screen-info.