omni.vu - Visual Understanding & Automation

Overview

omni.vu gives you eyes and hands on the user's macOS screen. Use it to:

See what the user sees (screen capture)
Understand UI state with AI vision
Detect changes and wait for events
Act with mouse and keyboard automation

When to Use This Skill

Proactively Use omni.vu When:

Debugging UI issues - "The button isn't working" → capture and analyze
Verifying changes - After modifying UI code, check if it rendered correctly
Waiting for operations - Build finishing, deployment completing, tests running
Understanding context - User describes something on screen you can't see
Automating repetitive tasks - Clicking through UI flows, filling forms
Documentation - Capturing screenshots of features

Do NOT Use When:

Reading/writing files (use Read/Write tools)
Running terminal commands (use Bash)
Making API calls (use appropriate tools)
User hasn't granted screen recording permission

Tool Reference

Capture Tools

Full Screen Capture

vu(monitor=0, save_to_history=True)

Captures the entire screen. Returns base64 image.

Use when: You need to see everything on screen.

vu_window

Window Capture

vu_window(window_id=None, title="VS Code", include_frame=False)

Captures a specific window by ID or title (partial match).

Use when: You only need one application's content.

Workflow:

Call vu_list_windows to see available windows
Find the window_id or use title matching
Call vu_window with that ID/title

vu_region

Region Capture

vu_region(x=100, y=100, width=500, height=300)

Captures a specific rectangle of the screen.

Use when: You need a precise area (error message, specific component).

vu_list_windows

List Windows

vu_list_windows(filter_app="Chrome")

Lists all visible windows with metadata.

Returns: List of {window_id, title, owner, x, y, width, height}

vu_list_monitors

List Displays

vu_list_monitors()

Lists connected displays with resolution and scale factor.

Vision Tools

vu_describe

AI Vision Analysis

vu_describe( prompt="What errors are visible?", provider="claude", # or openai, gemini, ollama max_tokens=1024, capture_first=True )

Captures screen and analyzes with AI.

Best prompts:

"Describe what you see on screen"
"Are there any error messages visible?"
"What is the state of the build/test output?"
"Is the login form filled correctly?"
"What color is the status indicator?"

Providers:

Provider Speed Quality Cost

claude Medium Excellent $$

openai Fast Very Good $$

gemini Fast Good $

ollama Slow Varies Free

Utility Tools

vu_diff

Change Detection

vu_diff(threshold=0.02, monitor=0)

Detects if screen changed since last capture.

Returns: {changed: bool, diff_percentage: float}

Use for polling patterns:

Wait for build to finish

while True: result = vu_diff(threshold=0.05) if result["changed"]: # Screen changed, check what happened analysis = vu_describe(prompt="Did the build succeed or fail?") break # Wait before checking again

vu_history

Capture History

vu_history(limit=10, capture_type="full_screen")

Gets recent capture metadata.

vu_last

Last Capture

vu_last()

Returns the most recent capture with image data.

Use when: You need to re-analyze without re-capturing.

vu_status

System Status

vu_status()

Returns system info: monitors, providers, safety settings.

Automation Tools

vu_click

Mouse Click

vu_click(x=500, y=300, button="left", count=1)

Clicks at screen coordinates.

Buttons: left , right , middle

Count: 1 (single), 2 (double), 3 (triple)

Safety: Coordinates validated against screen bounds.

vu_move

Move Cursor

vu_move(x=500, y=300, duration_ms=0)

Moves cursor to position. Use duration_ms for animated movement.

vu_drag

Drag Operation

vu_drag(start_x=100, start_y=100, end_x=500, end_y=300, duration_ms=500)

Drags from start to end position.

Safety: Maximum 2000px drag distance.

vu_scroll

Scroll

vu_scroll(direction="down", amount=3, x=None, y=None)

Scrolls at current or specified position.

Directions: up , down , left , right

Amount: Lines to scroll (1-20)

vu_type

Type Text

vu_type(text="Hello World", delay_between_ms=0)

Types text character by character.

Note: Click on target field first!

vu_hotkey

Keyboard Shortcut

vu_hotkey(keys="cmd+s")

Executes keyboard shortcuts.

Format: modifier+key (e.g., cmd+c , ctrl+shift+s , cmd+opt+i ) Modifiers: cmd , ctrl , alt /opt , shift

Blocked hotkeys (for safety):

cmd+q (quit)
cmd+shift+q (logout)
cmd+opt+esc (force quit)

Common Workflows

Debug UI Issue

User: "The submit button doesn't do anything when I click it"

vu_describe(prompt="Describe the form and submit button state")
Analyze: Is button disabled? Is there validation error?
If needed: vu_window(title="Chrome") for focused capture
Check console: vu_describe(prompt="Are there any errors in the developer console?")
Verify Build/Deploy

User: "Run the build and let me know when it's done"

Run: npm run build (via Bash)
Loop: vu_diff(threshold=0.05)
When changed: vu_describe(prompt="What is the build status? Did it succeed or fail?")
Report result
Fill a Form

User: "Fill out the registration form with test data"

vu_describe(prompt="What form fields are visible?")
vu_click(x=field_x, y=field_y) # Click first field
vu_type(text="testuser@example.com")
vu_hotkey(keys="tab") # Move to next field
vu_type(text="Test User")
Continue for each field...
vu_click(x=submit_x, y=submit_y) # Submit
vu_describe(prompt="Was the form submitted successfully?")
Navigate UI

User: "Open the settings panel in VS Code"

vu_list_windows(filter_app="Code")
vu_window(title="Code") # Capture VS Code
vu_hotkey(keys="cmd+,") # Open settings
vu_diff() # Wait for settings to open
vu_describe(prompt="Are the VS Code settings now visible?")
Monitor for Changes

User: "Watch the dashboard and tell me when the status changes"

vu() # Initial capture
Loop every 5 seconds:
- vu_diff(threshold=0.02)
- If changed: vu_describe(prompt="What changed on the dashboard?")
Report changes to user

Best Practices

Capture Before Acting

Always capture and analyze before clicking/typing:

❌ Bad: vu_click(x=500, y=300) # Hope it's the right spot

✅ Good:

vu_describe(prompt="Where is the submit button?")
Parse coordinates from description
vu_click(x=parsed_x, y=parsed_y)
Use Appropriate Capture Scope

Full screen → General context, multi-app workflows Window → Single app focus, cleaner output Region → Specific element, faster/smaller

Specific Vision Prompts

❌ Vague: "What do you see?"

✅ Specific:

"Is there an error message visible? If so, what does it say?"
"What is the status of the test runner? Passing, failing, or running?"
"List all form fields visible and their current values"

Rate Limit Automation

Don't spam clicks. Wait between actions:

vu_click(x=100, y=100)

Wait for UI response

vu_diff(threshold=0.01) vu_click(x=200, y=200)

Verify After Actions

Always verify automation succeeded:

vu_type(text="hello@example.com") vu_describe(prompt="What text is now in the email field?")

Troubleshooting

"Permission denied" or blank captures

→ Grant Screen Recording permission: System Settings > Privacy > Screen Recording

Automation not working

→ Grant Accessibility permission: System Settings > Privacy > Accessibility

Vision analysis failing

→ Check API keys in environment variables → Try different provider: vu_describe(provider="openai")

Coordinates seem off

→ Retina display? Coordinates are in logical points, not pixels → Multi-monitor? Check vu_list_monitors() for display arrangement

Hotkey blocked

→ Some dangerous hotkeys (cmd+q) are blocked for safety → Unblock in config if absolutely necessary

Configuration

Environment variables:

OMNI_VU_ANTHROPIC_API_KEY=sk-ant-... # For Claude vision OMNI_VU_OPENAI_API_KEY=sk-... # For GPT-4 vision OMNI_VU_DEFAULT_PROVIDER=claude # Default AI provider OMNI_VU_SAFETY_LEVEL=medium # low/medium/high

History stored at: ~/.omni.vu/captures/

omni-vu

Safety Notice

Copy this and send it to your AI assistant to learn

Wait for build to finish

Wait for UI response

Source Transparency

Related Skills

chatbot designer

agent workflow builder

git-workflow-designer

llc-ops