OS Use - Cross-Platform OS Automation
A comprehensive cross-platform toolkit for OS automation, screenshot capture, visual recognition, mouse/keyboard control, and window management. Supports macOS 12+ and Windows 10+.
Platform Support Matrix
| Feature | macOS Implementation | Windows Implementation |
|---|---|---|
| Screenshot | pyautogui + PIL | pyautogui + PIL |
| Visual Recognition | opencv-python + pyautogui | opencv-python + pyautogui |
| Mouse/Keyboard | pyautogui | pyautogui |
| Window Management | AppleScript (native) | pywinauto / pygetwindow |
| Application Control | AppleScript / subprocess | subprocess / pywinauto |
| Browser Automation | Chrome DevTools MCP | Chrome DevTools MCP |
Capabilities
1. Screenshot Capture 📸
Universal (macOS & Windows):
- Full screen capture
- Region capture (specified coordinates)
- Window capture (specific application window)
- Clipboard screenshot access
Implementation: pyautogui.screenshot() + PIL.Image
2. Visual Recognition 👁️
Universal (macOS & Windows):
- Image matching/locating on screen
- Template matching with confidence threshold
- Multi-scale matching (handle different resolutions)
- Color detection and region extraction
Optional OCR:
- Text recognition from screenshots (requires
pytesseract+ Tesseract OCR engine)
Implementation: opencv-python + pyautogui.locateOnScreen()
3. Mouse & Keyboard Control 🖱️⌨️
Universal (macOS & Windows):
- Mouse movement (absolute and relative coordinates)
- Mouse clicking (left, right, middle, double-click)
- Mouse dragging and dropping
- Scroll wheel operations
- Keyboard text input
- Keyboard shortcuts and hotkeys
- Special key combinations
Implementation: pyautogui
4. Window Management 🪟
macOS Implementation:
- List all application windows
- Get window position, size, title
- Activate/minimize/close windows
- Move and resize windows
- Launch/quit applications
Implementation: AppleScript via subprocess
Windows Implementation:
- Same capabilities as macOS
- Additional: Get window handle (HWND), process information
- Better integration with Windows window manager
Implementation: pywinauto or pygetwindow
5. Browser Automation 🌐
Universal (macOS & Windows):
- Webpage screenshots
- Element screenshots
- Page navigation
- Form filling and clicking
- Network monitoring
- Performance analysis
Implementation: Chrome DevTools MCP (separate tool)
6. System Integration 🔧
Clipboard Operations:
- Read/write clipboard content
- Support images and text
Implementation: pyperclip + pyautogui
Technical Implementation Details
Python Environment Setup
# Create virtual environment
python3 -m venv ~/.nanobot/workspace/macos-automation/.venv
# Activate
source ~/.nanobot/workspace/macos-automation/.venv/bin/activate
# Install dependencies
pip install pyautogui opencv-python-headless numpy Pillow pyperclip
# macOS specific
# (AppleScript is built-in, no installation needed)
# Windows specific
pip install pywinauto pygetwindow
Key Libraries Reference
| Library | Version | Purpose |
|---|---|---|
pyautogui | 0.9.54+ | Screenshot, mouse/keyboard control |
opencv-python-headless | 4.11.0.84+ | Image recognition, computer vision |
numpy | 2.4.2+ | Numerical operations for OpenCV |
Pillow | 12.1.1+ | Image processing |
pyperclip | Latest | Clipboard operations |
pywinauto | Latest | Windows window management |
pygetwindow | Latest | Cross-platform window control |
Platform-Specific Notes
macOS Specifics
Permissions Required:
- Accessibility: System Settings > Privacy & Security > Accessibility
- Screen Recording: System Settings > Privacy & Security > Screen Recording
AppleScript Quirks:
- Some modern apps (e.g., Chrome) may have limited AppleScript support
- Window titles may be truncated or localized
- Some operations require app to be frontmost
Coordinate System:
- Origin (0, 0) at top-left
- Retina displays: pyautogui automatically handles scaling
Windows Specifics
Administrator Privileges:
- Some operations (e.g., interacting with elevated windows) may require admin rights
High DPI Displays:
- Windows scaling may affect coordinate accuracy
- Use
pyautogui.size()to get actual screen dimensions
Window Handle (HWND):
- Windows provides low-level window handles for precise control
pywinautoprovides both high-level and low-level access
Error Handling Patterns
import pyautogui
import time
# Pattern 1: Retry with backoff
def retry_with_backoff(func, max_retries=3, base_delay=1):
for i in range(max_retries):
try:
return func()
except Exception as e:
if i == max_retries - 1:
raise
delay = base_delay * (2 ** i)
print(f"Retry {i+1}/{max_retries} after {delay}s: {e}")
time.sleep(delay)
# Pattern 2: Safe operations with fallback
def safe_screenshot(output_path):
try:
screenshot = pyautogui.screenshot()
screenshot.save(output_path)
return output_path
except Exception as e:
print(f"Screenshot failed: {e}")
return None
# Pattern 3: Coordinate boundary checking
def safe_click(x, y, max_x=None, max_y=None):
"""安全点击,确保坐标在屏幕范围内"""
if max_x is None or max_y is None:
max_x, max_y = pyautogui.size()
x = max(0, min(x, max_x - 1))
y = max(0, min(y, max_y - 1))
pyautogui.click(x, y)
Usage Examples by Scenario
Scenario 1: Automated Testing
"""
自动化 UI 测试示例
测试一个假设的登录页面
"""
import pyautogui
import time
def test_login_flow():
# 1. 截取初始状态
initial_screenshot = pyautogui.screenshot()
initial_screenshot.save("test_01_initial.png")
# 2. 查找并点击登录按钮
button_location = pyautogui.locateOnScreen(
"login_button.png",
confidence=0.9
)
if button_location:
center = pyautogui.center(button_location)
pyautogui.click(center.x, center.y)
time.sleep(1)
# 3. 输入用户名
pyautogui.typewrite("testuser@example.com", interval=0.01)
pyautogui.press('tab')
# 4. 输入密码
pyautogui.typewrite("TestPassword123", interval=0.01)
# 5. 点击提交
pyautogui.press('return')
time.sleep(2)
# 6. 验证结果
result_screenshot = pyautogui.screenshot()
result_screenshot.save("test_02_result.png")
# 检查是否出现成功提示
success_indicator = pyautogui.locateOnScreen(
"success_message.png",
confidence=0.8
)
if success_indicator:
print("✅ 测试通过:登录成功")
return True
else:
print("❌ 测试失败:未找到成功提示")
return False
# 运行测试
if __name__ == "__main__":
test_login_flow()
Scenario 2: Data Entry Automation
"""
数据录入自动化示例
将 Excel 数据自动填入网页表单
"""
import pyautogui
import pandas as pd
import time
def automate_data_entry(excel_file, form_template):
"""
从 Excel 读取数据并自动填入表单
Args:
excel_file: Excel 文件路径
form_template: 表单字段与 Excel 列的映射
"""
# 1. 读取 Excel 数据
df = pd.read_excel(excel_file)
print(f"读取到 {len(df)} 条记录")
# 2. 遍历每条记录
for index, row in df.iterrows():
print(f"\n正在处理第 {index + 1} 条记录...")
# 3. 填写每个字段
for field_name, column_name in form_template.items():
value = row.get(column_name, '')
# 查找表单字段(需要提前准备字段截图)
field_location = pyautogui.locateOnScreen(
f"form_field_{field_name}.png",
confidence=0.8
)
if field_location:
# 点击字段
center = pyautogui.center(field_location)
pyautogui.click(center.x, center.y)
time.sleep(0.2)
# 输入值
pyautogui.hotkey('ctrl', 'a') # 全选
pyautogui.typewrite(str(value), interval=0.01)
time.sleep(0.2)
else:
print(f" ⚠️ 未找到字段: {field_name}")
# 4. 提交表单
submit_btn = pyautogui.locateOnScreen(
"submit_button.png",
confidence=0.8
)
if submit_btn:
center = pyautogui.center(submit_btn)
pyautogui.click(center.x, center.y)
print(" ✅ 已提交")
time.sleep(2) # 等待提交完成
else:
print(" ⚠️ 未找到提交按钮")
# 5. 准备下一条记录
# 可能需要点击"添加新记录"或返回列表
time.sleep(1)
print("\n🎉 所有记录处理完成!")
# 使用示例
if __name__ == "__main__":
# 表单模板:字段名 -> Excel 列名
form_template = {
"name": "姓名",
"email": "邮箱",
"phone": "电话",
"address": "地址"
}
automate_data_entry("data.xlsx", form_template)
Scenario 3: Screen Monitoring & Alerting
"""
屏幕监控与告警示例
监控特定区域变化,发现变化时发送通知
"""
import pyautogui
import cv2
import numpy as np
import time
from datetime import datetime
def monitor_screen_region(region, template_image=None, check_interval=5, callback=None):
"""
监控屏幕特定区域的变化
Args:
region: (left, top, width, height) 监控区域
template_image: 要查找的模板图像路径(可选)
check_interval: 检查间隔(秒)
callback: 发现变化时的回调函数
Returns:
监控会话对象(可调用 stop() 停止)
"""
class MonitorSession:
def __init__(self):
self.running = True
self.baseline = None
def stop(self):
self.running = False
session = MonitorSession()
print(f"🔍 开始监控区域: {region}")
print(f"⏱️ 检查间隔: {check_interval}秒")
print("按 Ctrl+C 停止监控\n")
try:
while session.running:
# 捕获当前区域
current = pyautogui.screenshot(region=region)
current_array = np.array(current)
if template_image:
# 模式1: 查找模板图像
template_location = pyautogui.locateOnScreen(
template_image,
confidence=0.8
)
if template_location:
print(f"✅ [{datetime.now()}] 找到模板图像: {template_location}")
if callback:
callback('template_found', {
'location': template_location,
'screenshot': current
})
else:
# 模式2: 检测变化
if session.baseline is None:
session.baseline = current_array
print(f"📸 [{datetime.now()}] 已建立基准图像")
else:
# 计算差异
diff = cv2.absdiff(session.baseline, current_array)
diff_gray = cv2.cvtColor(diff, cv2.COLOR_RGB2GRAY)
diff_score = np.mean(diff_gray)
if diff_score > 10: # 阈值可调
print(f"⚠️ [{datetime.now()}] 检测到变化! 差异分数: {diff_score:.2f}")
if callback:
callback('change_detected', {
'diff_score': diff_score,
'screenshot': current,
'baseline': session.baseline
})
# 更新基准
session.baseline = current_array
time.sleep(check_interval)
except KeyboardInterrupt:
print("\n🛑 监控已停止")
return session
# 使用示例
def alert_callback(event_type, data):
"""告警回调函数示例"""
if event_type == 'template_found':
print(f"🎯 模板出现在: {data['location']}")
# 可以在这里发送通知、发送邮件、执行操作等
elif event_type == 'change_detected':
print(f"📊 变化强度: {data['diff_score']}")
# 保存差异图像
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
data['screenshot'].save(f"change_{timestamp}.png")
if __name__ == "__main__":
# 示例1: 监控屏幕变化
print("=== 监控屏幕变化 ===")
monitor = monitor_screen_region(
region=(0, 0, 1920, 1080), # 全屏
check_interval=5, # 每5秒检查一次
callback=alert_callback
)
# 10分钟后停止(实际使用可以一直运行)
# time.sleep(600)
# monitor.stop()
# 示例2: 查找特定图像
# monitor = monitor_screen_region(
# region=(0, 0, 1920, 1080),
# template_image="target_button.png", # 要查找的图像
# check_interval=2,
# callback=alert_callback
# )
Advanced Techniques
Handling Multiple Monitors
import pyautogui
def get_all_screen_sizes():
"""获取所有显示器尺寸(仅 Windows 支持多显示器详细信息)"""
# macOS 返回主屏尺寸
# Windows 可以使用 pygetwindow 或 win32api 获取多显示器信息
primary = pyautogui.size()
print(f"主屏幕尺寸: {primary}")
# Windows 示例(需要安装 pywin32)
try:
import win32api
monitors = win32api.EnumDisplayMonitors()
for i, monitor in enumerate(monitors):
print(f"显示器 {i+1}: {monitor[2]}")
except ImportError:
pass
return primary
def screenshot_specific_monitor(monitor_num=0):
"""截图指定显示器(实验性功能)"""
# 目前 pyautogui 主要支持主显示器
# 多显示器支持需要平台特定代码
pass
Performance Optimization
import cv2
import numpy as np
import pyautogui
import time
from functools import lru_cache
class ScreenCache:
"""屏幕缓存优化器"""
def __init__(self, cache_duration=0.5):
self.cache_duration = cache_duration
self.last_capture = None
self.last_capture_time = 0
def get_screenshot(self, region=None):
"""获取截图(带缓存)"""
current_time = time.time()
# 检查缓存是否有效
if (self.last_capture is not None and
current_time - self.last_capture_time < self.cache_duration and
region is None):
return self.last_capture
# 捕获新截图
screenshot = pyautogui.screenshot(region=region)
if region is None:
self.last_capture = screenshot
self.last_capture_time = current_time
return screenshot
def clear_cache(self):
"""清除缓存"""
self.last_capture = None
self.last_capture_time = 0
class FastImageFinder:
"""快速图像查找器(使用多尺度金字塔)"""
def __init__(self, scales=[0.8, 0.9, 1.0, 1.1, 1.2]):
self.scales = scales
def find_multi_scale(self, template_path, screenshot=None, confidence=0.8):
"""
多尺度图像查找
Returns:
(x, y, scale) 或 None
"""
if screenshot is None:
screenshot = pyautogui.screenshot()
template = cv2.imread(template_path)
if template is None:
return None
screenshot_cv = cv2.cvtColor(np.array(screenshot), cv2.COLOR_RGB2BGR)
for scale in self.scales:
# 缩放模板
scaled_template = cv2.resize(
template,
None,
fx=scale,
fy=scale,
interpolation=cv2.INTER_AREA
)
# 模板匹配
result = cv2.matchTemplate(
screenshot_cv,
scaled_template,
cv2.TM_CCOEFF_NORMED
)
_, max_val, _, max_loc = cv2.minMaxLoc(result)
if max_val >= confidence:
h, w = scaled_template.shape[:2]
center_x = max_loc[0] + w // 2
center_y = max_loc[1] + h // 2
return (center_x, center_y, scale)
return None
# 使用示例
cache = ScreenCache()
finder = FastImageFinder()
# 快速截图(带缓存)
screenshot = cache.get_screenshot()
# 多尺度图像查找
result = finder.find_multi_scale("button.png", screenshot)
if result:
x, y, scale = result
print(f"找到图像: ({x}, {y}), 缩放: {scale}")
Security Considerations
"""
安全最佳实践
"""
import pyautogui
import hashlib
import time
class SecureAutomation:
"""安全自动化包装器"""
def __init__(self):
self.action_log = []
self.max_retries = 3
self.rate_limit_delay = 0.1 # 操作间隔
def log_action(self, action, details):
"""记录操作日志"""
timestamp = time.strftime("%Y-%m-%d %H:%M:%S")
log_entry = {
'timestamp': timestamp,
'action': action,
'details': details,
'hash': hashlib.md5(f"{timestamp}{action}{details}".encode()).hexdigest()[:8]
}
self.action_log.append(log_entry)
def safe_click(self, x, y, description=""):
"""安全点击(带验证)"""
try:
# 验证坐标在屏幕范围内
screen_width, screen_height = pyautogui.size()
if not (0 <= x < screen_width and 0 <= y < screen_height):
raise ValueError(f"坐标 ({x}, {y}) 超出屏幕范围")
# 执行点击
pyautogui.moveTo(x, y, duration=0.2)
time.sleep(self.rate_limit_delay)
pyautogui.click()
# 记录日志
self.log_action('click', f"({x}, {y}) - {description}")
return True
except Exception as e:
self.log_action('click_failed', f"({x}, {y}) - Error: {str(e)}")
return False
def safe_typewrite(self, text, interval=0.01):
"""安全输入(敏感信息不记录)"""
try:
pyautogui.typewrite(text, interval=interval)
self.log_action('typewrite', f"输入 {len(text)} 个字符 [内容已隐藏]")
return True
except Exception as e:
self.log_action('typewrite_failed', f"Error: {str(e)}")
return False
def get_action_report(self):
"""生成操作报告"""
total = len(self.action_log)
successful = sum(1 for log in self.action_log if 'failed' not in log['action'])
failed = total - successful
report = f"""
=== 自动化操作报告 ===
总操作数: {total}
成功: {successful}
失败: {failed}
成功率: {(successful/total*100):.1f}%
详细日志:
"""
for log in self.action_log:
report += f"[{log['timestamp']}] [{log['hash']}] {log['action']}: {log['details']}\n"
return report
# 使用示例
secure = SecureAutomation()
# 执行安全操作
secure.safe_click(500, 400, "登录按钮")
secure.safe_typewrite("username@example.com")
secure.safe_click(500, 450, "密码输入框")
secure.safe_typewrite("********")
secure.safe_click(500, 500, "提交按钮")
# 生成报告
print(secure.get_action_report())
Troubleshooting Guide
Common Issues and Solutions
1. Permission Errors
Symptom: pyautogui fails with permission errors or captures black screenshots.
macOS Solution:
- Open System Settings > Privacy & Security > Accessibility
- Add your terminal application (e.g., Terminal.app, iTerm.app, or the Python executable)
- Repeat for Screen Recording permission
Windows Solution:
- Run as Administrator if needed
- Check Windows Defender or antivirus isn't blocking
2. Coordinate Inaccuracy
Symptom: Clicks or screenshots miss the intended target.
Possible Causes:
- High DPI / Retina display scaling
- Multiple monitors with different resolutions
- Window decorations or taskbar affecting coordinates
Solution:
import pyautogui
# Debug: Print screen info
print(f"Screen size: {pyautogui.size()}")
print(f"Mouse position: {pyautogui.position()}")
# Handle high DPI (Windows)
import ctypes
ctypes.windll.user32.SetProcessDPIAware() # Windows only
3. Image Recognition Failures
Symptom: locateOnScreen returns None even when image is visible.
Common Causes:
- Resolution mismatch (captured image at different scale)
- Color depth differences
- Transparency or alpha channel issues
- Confidence threshold too high
Solutions:
import pyautogui
import cv2
import numpy as np
# Solution 1: Lower confidence
location = pyautogui.locateOnScreen('button.png', confidence=0.7) # Default is 0.9
# Solution 2: Multi-scale matching (see FastImageFinder class in Performance section)
finder = FastImageFinder(scales=[0.5, 0.75, 1.0, 1.25, 1.5])
result = finder.find_multi_scale('button.png')
# Solution 3: Convert to grayscale for matching
screenshot = pyautogui.screenshot()
screenshot_cv = cv2.cvtColor(np.array(screenshot), cv2.COLOR_RGB2GRAY)
template = cv2.imread('button.png', cv2.IMREAD_GRAYSCALE)
result = cv2.matchTemplate(screenshot_cv, template, cv2.TM_CCOEFF_NORMED)
min_val, max_val, min_loc, max_loc = cv2.minMaxLoc(result)
if max_val >= 0.8:
print(f"找到匹配,置信度: {max_val}")
h, w = template.shape
center_x = max_loc[0] + w // 2
center_y = max_loc[1] + h // 2
pyautogui.click(center_x, center_y)
4. Slow Performance
Symptom: Operations are slow, high CPU usage, or noticeable delays.
Optimization Strategies:
-
Reduce Screenshot Frequency
- Cache screenshots when possible
- Use region-specific captures instead of full screen
-
Optimize Image Matching
- Resize large images before matching
- Use grayscale matching when color isn't important
- Set appropriate confidence levels
-
Batch Operations
- Group multiple actions together
- Minimize unnecessary delays
See the "Performance Optimization" section for detailed code examples.
5. Application-Specific Issues
Browser Automation:
- Modern browsers may block automation
- Use Chrome DevTools Protocol instead of pyautogui for web
- Consider Playwright or Selenium for complex web automation
Game/Graphics Applications:
- DirectX/OpenGL apps may not be capturable by standard screenshot
- May require specialized tools (e.g., OBS Studio's capture API)
Protected Content:
- DRM-protected content (Netflix, etc.) cannot be screenshotted
- This is a system-level restriction
Integration with Other Tools
With ChatGPT/AI Assistants
This skill is designed to work with AI assistants like nanobot. Here's how to integrate:
# Example: AI assistant using this skill
def ai_assisted_automation(user_request):
"""
AI 助手使用自动化技能
Args:
user_request: 用户的自然语言请求
"""
# 1. AI 解析用户意图
intent = parse_intent(user_request)
if intent == 'screenshot':
# 2. 执行截图
screenshot = pyautogui.screenshot()
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
path = f"screenshot_{timestamp}.png"
screenshot.save(path)
return f"已截图并保存到: {path}"
elif intent == 'click_button':
# 2. 查找并点击按钮
button_name = extract_button_name(user_request)
location = pyautogui.locateOnScreen(f"{button_name}.png")
if location:
pyautogui.click(pyautogui.center(location))
return f"已点击按钮: {button_name}"
else:
return f"未找到按钮: {button_name}"
# ... 其他意图处理
With CI/CD Pipelines
# Example: GitHub Actions using this skill for visual testing
name: Visual Regression Tests
on: [push, pull_request]
jobs:
visual-test:
runs-on: macos-latest # or windows-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install pyautogui opencv-python-headless numpy Pillow
- name: Run visual tests
run: python tests/visual_regression.py
- name: Upload screenshots
uses: actions/upload-artifact@v3
with:
name: screenshots
path: screenshots/
With Monitoring Systems
# Example: Integration with Prometheus/Grafana for screen monitoring
from prometheus_client import Gauge, start_http_server
import pyautogui
import time
# Define metrics
screen_change_gauge = Gauge('screen_change_score', 'Screen change detection score')
template_match_gauge = Gauge('template_match_confidence', 'Template matching confidence')
start_http_server(8000)
def monitoring_loop():
baseline = None
while True:
# Capture screen
current = pyautogui.screenshot()
current_array = np.array(current)
if baseline is not None:
# Calculate change
diff = cv2.absdiff(baseline, current_array)
diff_score = np.mean(diff)
screen_change_gauge.set(diff_score)
baseline = current_array
# Check for template
try:
location = pyautogui.locateOnScreen('alert_icon.png', confidence=0.8)
if location:
template_match_gauge.set(1.0)
else:
template_match_gauge.set(0.0)
except:
template_match_gauge.set(0.0)
time.sleep(5)
monitoring_loop()
Future Roadmap
Planned Features
-
Linux Support
- X11 and Wayland compatibility
xdotoolandscrotintegrationmssfor multi-monitor support
-
AI-Powered Recognition
- Integration with OpenAI GPT-4V or Google Gemini for visual understanding
- Natural language element finding ("click the blue submit button")
- OCR-free text extraction using vision models
-
Mobile Device Support
- Android: ADB (Android Debug Bridge) integration
- iOS: WebDriverAgent via Appium
- Screenshot and touch simulation
-
Cloud Integration
- AWS Lambda support for serverless automation
- Azure Functions and GCP Cloud Functions compatibility
- Distributed screenshot processing
-
Advanced Analytics
- Built-in A/B testing framework for UI changes
- Heatmap generation from user interactions
- Performance regression detection
Contributing
We welcome contributions! Please see the Contributing Guide for details on:
- Code style and formatting
- Testing requirements
- Documentation standards
- Pull request process
License
This skill is licensed under the MIT License. See LICENSE for details.
Last Updated: 2026-03-06
Version: 1.0.0
Maintainer: nanobot skills team