Desktop Control Skill
Control the desktop GUI using semantic element targeting. Find and click UI elements by name instead of coordinates.
Key Features:
- AT-SPI - Primary method using accessibility tree (knows element roles, states, actions)
- OCR Fallback - Tesseract-based text finding when AT-SPI can't find the element
- Wait Utilities - Poll for elements to appear with exponential backoff
- Click Verification - Optional pre-click screenshot verification
Prerequisites
Install dependencies:
bash install.sh
Or manually:
# System packages
sudo apt-get install -y xdotool scrot imagemagick \
at-spi2-core libatk-adaptor python3-gi gir1.2-atspi-2.0 \
tesseract-ocr tesseract-ocr-eng python3-pip
# Python packages
pip3 install -r requirements.txt
For headless Xvfb sessions:
export GTK_MODULES=gail:atk-bridge
export QT_LINUX_ACCESSIBILITY_ALWAYS_ON=1
/usr/lib/at-spi2-core/at-spi-bus-launcher &
Commands
All commands use DISPLAY=:10.0 by default. Override with --display flag.
find-element
Find UI element via AT-SPI with OCR fallback.
python3 scripts/desktop.py find-element --name "Confirm" [--role button] [--app Firefox]
| Parameter | Type | Required | Description |
|---|---|---|---|
--name, -n | string | No | Element name/text to find |
--role, -r | string | No | Element role (button, entry, etc.) |
--app, -a | string | No | Application name filter |
--all | flag | No | Find all matches |
--clickable | flag | No | Only clickable elements |
--max-results | int | No | Maximum results (default: 50) |
Returns:
{
"element": {
"name": "Confirm",
"bounds": { "x": 400, "y": 300, "width": 100, "height": 30 },
"center": { "x": 450, "y": 315 },
"role": "push button",
"source": "atspi",
"visible": true,
"enabled": true,
"clickable": true
}
}
find-text
Find text on screen via OCR only.
python3 scripts/desktop.py find-text "I have an existing wallet" [--exact] [--all]
| Parameter | Type | Required | Description |
|---|---|---|---|
text | string | Yes | Text to find |
--exact | flag | No | Require exact match |
--case-sensitive | flag | No | Case-sensitive matching |
--all | flag | No | Find all occurrences |
--max-results | int | No | Maximum results (default: 50) |
Returns:
{
"match": {
"name": "I have an existing wallet",
"bounds": { "x": 200, "y": 400, "width": 180, "height": 20 },
"center": { "x": 290, "y": 410 },
"source": "ocr",
"confidence": 95.2
}
}
click-element
Click element by name/role (finds element first, then clicks at center).
python3 scripts/desktop.py click-element --name "Next" [--role button] [--verify]
| Parameter | Type | Required | Description |
|---|---|---|---|
--name, -n | string | No | Element name/text |
--role, -r | string | No | Element role |
--app, -a | string | No | Application name filter |
--right | flag | No | Right click |
--double | flag | No | Double click |
--verify | flag | No | OCR verify before click |
click-element requires at least one selector: --name or --role.
When --verify is used, OCR must be available and text must be provided (typically via --name).
Returns:
{
"clicked": {
"element": { "name": "Next", "..." },
"x": 450,
"y": 315,
"button": "left",
"double": false
}
}
wait-for
Wait for element or text to appear (with timeout and exponential backoff).
python3 scripts/desktop.py wait-for --name "Success" --timeout 30
python3 scripts/desktop.py wait-for --text "Transaction complete" --timeout 60
python3 scripts/desktop.py wait-for --name "Loading" --gone --timeout 30
| Parameter | Type | Required | Description |
|---|---|---|---|
--name, -n | string | No | Element name (AT-SPI + OCR) |
--role, -r | string | No | Element role (AT-SPI only) |
--app, -a | string | No | Application filter |
--text, -t | string | No | Text to find (OCR only) |
--exact | flag | No | Exact text match |
--gone | flag | No | Wait until disappears |
--timeout | float | No | Timeout in seconds (default: 30) |
Use either --text or element selectors (--name, --role, --app) for a single call, not both.
For element waits (with or without --gone), provide at least one of --name or --role.
Returns:
{ "found": { "name": "Success", "..." } }
// or
{ "gone": true, "name": "Loading" }
// or
{ "error": "Element not found within 30s", "timeout": true }
list-elements
List all interactive elements (buttons, inputs, links, etc.)
python3 scripts/desktop.py list-elements [--app Firefox] [--role button]
| Parameter | Type | Required | Description |
|---|---|---|---|
--app, -a | string | No | Application name filter |
--role, -r | string | No | Filter by role |
--include-hidden | flag | No | Include hidden elements |
--max-results | int | No | Maximum results (default: 100) |
Returns:
{
"elements": [
{ "name": "Sign In", "role": "push button", "..." },
{ "name": "Email", "role": "entry", "..." }
],
"count": 2
}
status
Check AT-SPI and OCR availability.
python3 scripts/desktop.py status
Returns:
{
"atspi": {
"available": true,
"applications": ["Firefox", "gnome-calculator"]
},
"ocr": { "available": true },
"display": ":10.0"
}
Original Commands
These coordinate-based commands are still available:
| Command | Description |
|---|---|
screenshot [--output PATH] | Take screenshot |
click X Y [--right] [--double] | Click at coordinates |
type "TEXT" [--type-delay MS] | Type text |
key "KEYS" | Press key combination |
move X Y | Move mouse |
active | Get active window info |
find-window "NAME" | Find windows by name |
focus "NAME" | Focus window by name |
position | Get mouse position |
windows | List all windows |
Example Workflows
MetaMask Transaction (Semantic)
# Wait for and click Confirm button
python3 scripts/desktop.py wait-for --name "Confirm" --role button --timeout 30
python3 scripts/desktop.py click-element --name "Confirm" --role button
# Wait for success message
python3 scripts/desktop.py wait-for --text "Transaction submitted" --timeout 60
Phantom Wallet Import (Semantic)
# Click "I have an existing wallet"
python3 scripts/desktop.py click-element --name "I already have a wallet"
# Wait for seed phrase input
python3 scripts/desktop.py wait-for --role entry --timeout 10
# Type seed phrase
python3 scripts/desktop.py type "word1 word2 word3..."
# Click Import
python3 scripts/desktop.py click-element --name "Import"
Hybrid Approach (Semantic + Coordinates)
# Use semantic for known buttons
python3 scripts/desktop.py click-element --name "Settings"
# Fall back to coordinates for unlabeled icons
python3 scripts/desktop.py screenshot --output /tmp/screen.png
# (analyze screenshot to get coordinates)
python3 scripts/desktop.py click 850 120
Tips
- Prefer semantic commands -
click-elementandwait-forare more robust than coordinates - Check status first - Run
statusto verify AT-SPI and OCR are available - Use --role for precision - Distinguish between buttons and text with same name
- Fall back to OCR - If AT-SPI doesn't expose an element,
find-textuses OCR - Wait instead of sleep -
wait-foris more reliable than fixed delays - Use --verify for critical clicks - Adds OCR verification before clicking