UI Element Ops
Parse one or more screenshots into a machine-readable JSON schema with:
type(normalized UI element type)bbox_pxandbbox_normtext(OCR/caption content when available)clickableflag- optional overlay image with labeled boxes
- desktop actions via
scripts/operate_ui.py(click/type/key/hotkey/screenshot) - element query and orchestration via
scripts/operate_ui.py(find,wait) - coordinate calibration profile for multi-display/DPI/window offset (
calibrate)
Quick Start
- Prepare runtime once per machine:
skills/ui-element-ops/scripts/bootstrap_omniparser_env.sh "$PWD"
- Parse one screenshot:
skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/1.jpeg
- Read outputs:
<image>.elements.json<image>.overlay.png
- One-step capture + parse with randomized names:
skills/ui-element-ops/scripts/capture_and_parse.sh
Workflow
- Confirm screenshot path and desired output path.
- Run
scripts/bootstrap_omniparser_env.shwhen.venvor OmniParser weights are missing. - Run
scripts/run_parse_ui.shfor standard parsing. - Report absolute output paths and summary counts:
total,clickable,by_type. - Call out obvious quality risks for tiny text or dense icon layouts.
- Execute desktop actions when requested:
- list elements:
python3 skills/ui-element-ops/scripts/operate_ui.py list --elements <json> - find elements:
python3 skills/ui-element-ops/scripts/operate_ui.py find --elements <json> --type button --text-contains login - wait for appear/disappear:
python3 skills/ui-element-ops/scripts/operate_ui.py wait --elements <json> --state appear --text-contains continue - click by id:
python3 skills/ui-element-ops/scripts/operate_ui.py click --elements <json> --id e_0001 - screenshot:
python3 skills/ui-element-ops/scripts/operate_ui.py screenshot(defaults to user tmp dir) - calibrate coordinates:
python3 skills/ui-element-ops/scripts/operate_ui.py calibrate --parsed-size <w> <h> --actual-size <w> <h>
- list elements:
Tunables
- Edit type mapping keywords in
references/type_rules.example.json. - Use advanced parser args via
scripts/parse_ui.py --help. - Use
--use-paddleocronly whenpaddleocr/paddlepaddleare installed.
Outputs
- Main JSON output:
schema_version,pipeline,image,counts,elements- each element has
id,type,bbox_px,bbox_norm,text,clickable
- Overlay PNG output:
- same screenshot with labeled detection boxes
Failure Handling
- Missing dependencies or weights: run bootstrap script again.
- Permission/cache errors under
$HOME: keep temporary caches under/tmp(handled by run script). - CPU-only machine: expect slower inference.
- Performance note: parse/capture-and-parse commands are heavy; avoid very tight loops and reuse recent
elements.jsonwhen possible. - Headless environment limitation:
- usable without GUI: parse/list/find/wait/calibrate on existing files.
- requires GUI session: click/click-xy/type/key/hotkey/screenshot/screen-info.