Browser Content Capture
Capture web content that traditional scrapers cannot access using agent-browser CLI.
Overview
This skill enables content extraction from sources that require browser-level access:
-
JavaScript-rendered SPAs (React, Vue, Angular apps)
-
Login-protected documentation (private wikis, gated content)
-
Dynamic content (infinite scroll, lazy loading, client-side routing)
-
Multi-page site crawls (documentation trees, tutorial series)
When to Use
Use when:
-
WebFetch returns empty or partial content
-
Page requires JavaScript execution to render
-
Content is behind authentication
-
Need to navigate multi-page structures
-
Extracting from client-side routed apps
Do NOT use when:
-
Static HTML pages (use WebFetch
-
faster)
-
Public API endpoints (use direct HTTP calls)
-
Simple RSS/Atom feeds
Quick Start
Basic Capture Pattern
1. Navigate to URL
agent-browser open https://docs.example.com
2. Wait for content to render
agent-browser wait --load networkidle
3. Get interactive snapshot
agent-browser snapshot -i
4. Extract text content
agent-browser get text body
5. Take screenshot
agent-browser screenshot /tmp/capture.png
6. Close when done
agent-browser close
agent-browser Commands Reference
Command Purpose When to Use
open <url>
Go to URL First step of any capture
snapshot -i
Get interactive element tree Understanding page structure
eval "<script>"
Run custom JS Extract specific content
click @e#
Click elements Navigate menus, pagination
fill @e# "value"
Fill inputs Authentication flows
wait @e#
Wait for element Dynamic content loading
screenshot <path>
Capture image Visual verification
console
Read JS console Debug extraction issues
network requests
Monitor XHR/fetch Find API endpoints
Quick reference: See references/agent-browser-commands.md or run agent-browser --help
Capture Patterns
Pattern 1: SPA Content Extraction
For React/Vue/Angular apps where content renders client-side:
Navigate and wait for hydration
agent-browser open https://react-docs.example.com agent-browser wait --load networkidle
Get snapshot to identify content element
agent-browser snapshot -i
Extract after framework mounts (use ref from snapshot)
agent-browser get text @e5 # Main content area
Or use eval for custom extraction
agent-browser eval "document.querySelector('article').innerText"
Details: See references/spa-extraction.md
Pattern 2: Authentication Flow
For login-protected content:
Navigate to login
agent-browser open https://docs.example.com/login agent-browser snapshot -i
Fill credentials (refs from snapshot)
agent-browser fill @e1 "user@example.com" # Email field agent-browser fill @e2 "password123" # Password field
Click submit and wait for redirect
agent-browser click @e3 agent-browser wait --url "**/dashboard"
Save authenticated state for reuse
agent-browser state save /tmp/auth-state.json
Now navigate to protected content
agent-browser open https://docs.example.com/private-docs
Details: See references/auth-handling.md
Pattern 3: Multi-Page Crawl
For documentation with navigation trees:
Get all page links from sidebar
agent-browser open https://docs.example.com agent-browser snapshot -i
Extract links via eval
LINKS=$(agent-browser eval "JSON.stringify(Array.from(document.querySelectorAll('nav a')).map(a => a.href))")
Iterate and capture each page
for link in $(echo "$LINKS" | jq -r '.[]'); do agent-browser open "$link" agent-browser wait --load networkidle agent-browser get text body > "/tmp/content-$(basename $link).txt" done
Details: See references/multi-page-crawl.md
Session Management
Save and Reuse Authentication
Login once and save state
agent-browser open https://app.example.com/login agent-browser snapshot -i agent-browser fill @e1 "$USERNAME" agent-browser fill @e2 "$PASSWORD" agent-browser click @e3 agent-browser wait --url "**/dashboard" agent-browser state save /tmp/app-auth.json
Later: restore state
agent-browser state load /tmp/app-auth.json agent-browser open https://app.example.com/protected-content
Parallel Sessions
Run isolated sessions for different tasks
agent-browser --session scrape1 open https://site1.com agent-browser --session scrape2 open https://site2.com
Extract from each
agent-browser --session scrape1 get text body > site1.txt agent-browser --session scrape2 get text body > site2.txt
Fallback Strategy
Use this decision tree for content capture:
User requests content from URL │ ▼ ┌─────────────┐ │ Try WebFetch│ ← Fast, no browser needed └─────────────┘ │ Content OK? ──Yes──► Done │ No (empty/partial) │ ▼ ┌──────────────────┐ │ Use agent-browser│ └──────────────────┘ │ ├─ Known SPA (react, vue, angular) ──► wait --load networkidle ├─ Requires login ──► Authentication flow with state save └─ Dynamic content ──► wait @element or wait --text
Best Practices
- Minimize Browser Usage
-
Always try WebFetch first (10x faster, no browser overhead)
-
Cache extracted content to avoid re-scraping
-
Use get text @e# to extract only needed content
- Handle Dynamic Content
-
Always use wait after navigation
-
Use wait --load networkidle for heavy SPAs
-
Use wait --text "Expected" for specific content
- Respect Rate Limits
-
Add delays between page navigations
-
Don't crawl faster than a human would browse
-
Honor robots.txt and terms of service
- Clean Extracted Content
-
Use targeted refs from snapshot to extract main content
-
Use eval to remove noise elements before extraction
-
Convert to clean markdown for downstream processing
Troubleshooting
Issue Solution
Empty content Add wait --load networkidle after navigation
Partial render Use wait --text "Expected content"
Login required Use authentication flow with state save/load
CAPTCHA blocking Manual intervention required
Content in iframe Use frame @e# then extract
Related Skills
-
browser-automation
-
agent-browser CLI quick start and integration
-
webapp-testing
-
Playwright test automation patterns
-
streaming-api-patterns
-
Handle SSE progress updates
Version: 2.0.0 (January ) Browser Tool: agent-browser CLI (replaces Playwright MCP)
Capability Details
spa-extraction
Keywords: react, vue, angular, spa, javascript, client-side, hydration, ssr Solves:
-
WebFetch returns empty content
-
Page requires JavaScript to render
-
React/Vue app content extraction
auth-handling
Keywords: login, authentication, session, cookie, protected, private, gated Solves:
-
Content behind login wall
-
Need to authenticate first
-
Private documentation access
multi-page-crawl
Keywords: crawl, sitemap, navigation, multiple pages, documentation, tutorial series Solves:
-
Capture entire documentation site
-
Extract multiple pages
-
Follow navigation links
agent-browser-commands
Keywords: agent-browser, open, snapshot, click, fill, eval, get text Solves:
-
Which command to use
-
Browser automation reference
-
agent-browser CLI guide