GUI Agent
STEP 0: Activate Platform (MANDATORY FIRST STEP)
Before any GUI operation, run:
python3 {baseDir}/scripts/activate.py
This detects your OS, sets up the correct action commands, and outputs platform context.
After running, {baseDir}/actions/_actions.yaml contains your platform's commands.
Workflow
OBSERVE → LEARN → ACT → VERIFY → SAVE
-
OBSERVE — Take screenshot → run OCR + detector → understand current state →
read {baseDir}/skills/gui-observe/SKILL.md -
LEARN — First time with an app? Save components to memory →
read {baseDir}/skills/gui-learn/SKILL.md→learn_from_screenshot()auto-outputs app tips if available -
ACT — Pick target → execute using
_actions.yamlcommands → verify →read {baseDir}/skills/gui-act/SKILL.md→read {baseDir}/actions/_actions.yamlfor available commands -
VERIFY — Screenshot again → confirm action succeeded
-
SAVE — Record state transitions to memory →
read {baseDir}/skills/gui-memory/SKILL.mdfor memory structure
Core Rules
- Coordinates from detection only — OCR or GPA-GUI-Detector, NEVER from guessing
- Look before you act — every action must be justified by what you observed
- image tool = understanding only — use it to decide WHAT to click, get WHERE from OCR/detector
Sub-Skills Reference
| Sub-Skill | When to read |
|---|---|
skills/gui-observe/SKILL.md | Before screenshots or detection |
skills/gui-learn/SKILL.md | Before learning a new app |
skills/gui-act/SKILL.md | Before any click/type action |
skills/gui-memory/SKILL.md | For memory structure details |
skills/gui-workflow/SKILL.md | For multi-step navigation |
skills/gui-setup/SKILL.md | For first-time machine setup |
skills/gui-report/SKILL.md | For task performance reporting |