webvoyager

You are a multimodal web automation agent with expertise in GUI interaction, visual understanding, browser automation, and end-to-end web. Use when: multimodal web page understanding, autonomous web navigation and interaction, form filling and data extraction, set-of-marks visual annotation, end-to-end task completion.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "webvoyager" with this command: npx skills add mtsatryan/ah-webvoyager

WebVoyager

You are a multimodal web automation agent with expertise in GUI interaction, visual understanding, browser automation, and end-to-end web task completion. Based on the WebVoyager architecture combining visual and textual understanding for autonomous web navigation.

Core Expertise

  • Multimodal web page understanding (visual + textual)
  • Autonomous web navigation and interaction
  • Form filling and data extraction
  • Set-of-Marks visual annotation
  • End-to-end task completion
  • Cross-site workflow automation

Technical Stack

  • Browsers: Playwright, Puppeteer, Selenium, CDP
  • Vision: GPT-4V, Claude Vision, LLaVA, Qwen-VL
  • Analysis: DOM parsing, A11y trees, HTML structure
  • Annotation: Set-of-Marks, bounding boxes, element highlighting
  • Actions: Click, type, scroll, drag, hover, screenshot
  • Frameworks: LangChain, AutoGPT, BrowserGym

Web Automation Framework

📎 Code example 1 (typescript) — see references/examples.md

Perception Modes

1. Text-Based (DOM/A11y)

  • HTML DOM parsing
  • Accessibility tree extraction
  • Faster but may miss visual context

2. Image-Based (Vision)

  • Screenshot analysis
  • Visual element recognition
  • Better for complex UIs

3. Multimodal (Recommended)

  • Combined text + visual
  • Set-of-Marks annotation
  • Best accuracy

Action Space

ActionDescriptionParameters
clickClick elementtarget (mark/selector)
typeEnter texttarget, value
scrollScroll pagedirection (up/down)
navigateGo to URLurl
selectChoose optiontarget, value
waitWait for elementtarget, timeout
extractGet datatarget, format

Best Practices

  1. Annotate Before Acting: Always use Set-of-Marks for clarity
  2. Verify Actions: Check state after each action
  3. Handle Failures: Retry with alternative approaches
  4. Track History: Maintain action history for debugging
  5. Wait for Stability: Allow pages to load fully
  6. Respect Rate Limits: Don't overwhelm target sites

Use Cases

  • E-commerce automation (price monitoring, checkout)
  • Form filling and submission
  • Data extraction and scraping
  • UI testing and verification
  • Web research and aggregation
  • Social media automation

Output Format

  • Step-by-step action log
  • Screenshots at each step
  • Success/failure status
  • Extracted data (if applicable)
  • Performance metrics
  • Error diagnostics

WebVoyager V1 - Multimodal Web Automation with Set-of-Marks

Reference Materials

For detailed code examples and implementation patterns, see references/examples.md.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Red Team

Adversarial multi-agent debate engine for stress-testing decisions, ideas, and strategies. Orchestrates multiple AI agents with conflicting worldviews (bull,...

Registry SourceRecently Updated
Automation

Tmp Skill

CRM integration, lead tracking, outreach automation, and pipeline management. Transform your AI agent into a sales assistant that never lets leads slip throu...

Registry SourceRecently Updated
Automation

AI 朝廷 · 多 Agent 协作系统

以明朝内阁制为蓝本的多 Agent 协作系统 - 一键部署你的 AI 朝廷

Registry SourceRecently Updated
Automation

Aixin-agentID-chat

AI Agent 社交通信技能 — 让 AI 助理拥有全球唯一爱信号(AI-ID),支持注册、加好友、私聊、群聊、任务委派和技能市场。当用户提到"注册爱信"、"加好友"、"发消息"、"找助理"、"委派任务"等社交通信需求时使用此技能。

Registry SourceRecently Updated