actionbook-scraper

Actionbook Scraper Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "actionbook-scraper" with this command: npx skills add actionbook/actionbook/actionbook-actionbook-actionbook-scraper

Actionbook Scraper Skill

⚠️ CRITICAL: Two-Part Verification

Every generated script MUST pass BOTH checks:

Check What to Verify Failure Example

Part 1: Script Runs No errors, no timeouts Selector not found

Part 2: Data Correct Content matches expected Extracted "Click to expand" instead of name

┌─────────────────────────────────────────────────────┐ │ 1. Generate Script │ │ ↓ │ │ 2. Execute Script │ │ ↓ │ │ 3. Check Part 1: Script runs without errors? │ │ ↓ │ │ 4. Check Part 2: Data content is correct? │ │ - Not empty │ │ - Not placeholder text ("Loading...") │ │ - Not UI text ("Click to expand") │ │ - Fields mapped correctly │ │ ↓ │ │ ┌───┴───┐ │ │ BOTH Pass Either Fails │ │ │ │ │ │ │ ↓ │ │ │ Is it Actionbook data issue? │ │ │ │ │ │ │ ┌───┴───┐ │ │ │ Yes No │ │ │ │ │ │ │ │ ↓ ↓ │ │ │ Log to Fix script │ │ │ .actionbook-issues.log │ │ │ │ │ │ │ │ └───┬───┘ │ │ │ ↓ │ │ │ Retry (max 3x) │ │ ↓ │ │ Output Script │ └─────────────────────────────────────────────────────┘

Default Output Format

/actionbook-scraper:generate <url>

DEFAULT = agent-browser script (bash commands)

agent-browser open "https://example.com" agent-browser scroll down 2000 agent-browser get text ".selector" agent-browser close

With --standalone Flag

/actionbook-scraper:generate <url> --standalone

Output = Playwright JavaScript code

Verification Requirements

Two-Part Verification

Every generated script must pass BOTH checks:

Check What to Verify Failure Action

  1. Script Runs No errors, no timeouts Fix syntax/selector errors

  2. Data Correct Content matches expected fields Fix extraction logic

Part 1: Script Execution Check

  • No runtime errors

  • No timeout errors

  • Browser closes properly

Part 2: Data Content Check (CRITICAL)

Verify extracted data matches the expected structure:

Expected: Company name, description, website, year founded Actual: "Click to expand", "Loading...", empty strings

→ FAIL: Data content incorrect, need to fix extraction logic

Data validation rules:

Rule Example Failure Fix

Fields not empty name: ""

Check selector targets correct element

No placeholder text name: "Loading..."

Add wait for dynamic content

No UI text name: "Click to expand"

Extract after expanding, not button text

Correct data type year: "View Details"

Wrong selector, fix field mapping

Reasonable count Expected ~100, got 3 Add scroll/pagination handling

For agent-browser Scripts

  • Execute the generated commands

  • Check script runs without errors

  • Check data content is correct:

  • Fields match expected structure

  • Values are actual data, not UI text

  • Count is reasonable

  • If failed:

  • Analyze what's wrong (script error vs data error)

  • Fix selector, wait logic, or extraction

  • Re-execute

  • If success:

  • Output the verified script

  • Show data preview with field validation

For Playwright Scripts (--standalone)

  • Write script to temp file

  • Run with node script.js

  • Check script runs without errors

  • Check output data is correct:

  • JSON structure matches expected fields

  • Values contain actual data

  • Count matches expected range

  • If failed:

  • Analyze error type

  • Fix script

  • Re-run

  • If success:

  • Output the verified script

Architecture Overview

/generate <url> → OUTPUT: agent-browser bash commands /generate <url> --standalone → OUTPUT: Playwright .js file

┌─────────────────────────────────────────────────────────────┐ │ /generate <url> │ │ │ │ 1. Search Actionbook → get selectors │ │ 2. Generate OUTPUT: │ │ │ │ WITHOUT --standalone │ WITH --standalone │ │ ───────────────────── │ ────────────────── │ │ agent-browser commands │ Playwright .js code │ │ │ │ │ bash │ javascript │ │ agent-browser open ... │ const { chromium } = ... │ │ agent-browser get ... │ await page.goto(...) │ │ agent-browser close │ │ │ │ │ └─────────────────────────────────────────────────────────────┘

Tool Priority

Operation Primary Tool Fallback Notes

Find selectors for URL search_actions

None Search by domain/keywords

Get full selector details get_action_by_id

None Use action_id from search

List available sources list_sources

search_sources

Browse all indexed sites

Generate agent-browser script Agent (sonnet)

Default mode for /generate

Generate Playwright script Agent (sonnet)

Use --standalone flag

Structure analysis Agent (haiku)

Parse Actionbook response

Request new website agent-browser

Manual Submit to actionbook.dev (ONLY command that executes agent-browser)

Workflow Rules

CRITICAL: Generate → Verify → Fix

Every generated script MUST be verified by executing it.

Step Action

1 Generate script with Actionbook selectors

2 Execute script to verify it works

3 If failed: analyze error, fix script, go to step 2

4 If success: output verified script + data preview

Verification Process

For agent-browser scripts:

Execute each command

agent-browser open "https://example.com" agent-browser wait --load networkidle agent-browser get text ".selector"

Check if data is returned

If error → fix and retry

agent-browser close

For Playwright scripts (--standalone):

Write to temp file and execute

node /tmp/scraper.js

Check if output file has data

If error → fix and retry

Critical Rules

  • ALWAYS verify generated scripts - Execute and check BOTH parts

  • Part 1: Script must run - No errors, no timeouts

  • Part 2: Data must be correct - Not empty, not UI text, fields mapped correctly

  • Fix errors automatically - Don't output broken scripts or wrong data

  • Use Actionbook MCP tools first - Never guess selectors

  • Include scroll handling for lazy-loaded pages

  • Include expand/collapse logic for card-based layouts

  • Always close browser - Include agent-browser close

  • Retry up to 3 times - If still failing, report the specific issue

Common Data Errors to Catch

Error Example Fix

Extracted button text name: "Click to expand"

Extract content after expanding

Extracted placeholder desc: "Loading..."

Add wait for dynamic content

Empty fields name: ""

Fix selector

Wrong field mapping year: "San Francisco"

Fix selector for each field

Too few items Expected 100, got 3 Add scroll/pagination

Record Actionbook Data Issues

If Actionbook selectors are wrong or outdated, record to local file:

.actionbook-issues.log

When to record:

  • Selector doesn't exist on page

  • Selector returns wrong element

  • Page structure has changed

  • Missing selectors for key elements

Log format:

[YYYY-MM-DD HH:MM] URL: {url} Action ID: {action_id} Issue Type: {selector_error | outdated | missing} Details: {description} Selector: {selector} Expected: {what it should select} Actual: {what it actually selects or error}

Selector Priority

When Actionbook provides multiple selectors, prefer in this order:

  • data-testid

  • Most stable, designed for automation

  • aria-label

  • Accessibility-based, semantic

  • css

  • Class-based selectors

  • xpath

  • Last resort, most fragile

Commands

Command Description Agent

/actionbook-scraper:analyze <url>

Analyze page structure and show available selectors structure-analyzer

/actionbook-scraper:generate <url>

Generate agent-browser scraper script code-generator

/actionbook-scraper:generate <url> --standalone

Generate Playwright/Puppeteer script code-generator

/actionbook-scraper:list-sources

List websites with Actionbook data

/actionbook-scraper:request-website <url>

Request new website to be indexed (uses agent-browser) website-requester

Data Flow

Analyze Command

  1. User: /actionbook-scraper:analyze https://example.com/page
  2. Extract domain from URL → "example.com"
  3. search_actions("example page") → [action_ids]
  4. For best match: get_action_by_id(action_id) → full selector data
  5. Structure-analyzer agent formats and presents findings

Generate Command (Default: agent-browser script)

User: /actionbook-scraper:generate https://example.com/page

Step 1: Search Actionbook search_actions("example.com page") → action_ids

Step 2: Get selectors get_action_by_id(best_match) → selectors

Step 3: Generate agent-browser script

agent-browser open "https://example.com/page"
agent-browser wait --load networkidle
agent-browser scroll down 2000
agent-browser get text ".item-container"
agent-browser close

Step 4: VERIFY script (REQUIRED)
Execute the commands and check if data is extracted
If failed → analyze error → fix script → retry (max 3x)

Step 5: Return verified script + data preview

**Example Output:**
````markdown
## Verified Scraper (agent-browser)

**Status**: ✅ Verified (extracted 50 items)

Run these commands to scrape:

```bash
agent-browser open "https://example.com/page"
agent-browser wait --load networkidle
agent-browser scroll down 2000
agent-browser get text ".item-container"
agent-browser close

Data Preview

[
{"name": "Item 1", "description": "..."},
{"name": "Item 2", "description": "..."},
// ... showing first 3 items
]

### Generate Command (--standalone: Playwright script)

User: /actionbook-scraper:generate https://example.com/page --standalone

Step 1: Search Actionbook for selectors Step 2: Get full selector data Step 3: Generate Playwright/Puppeteer script Step 4: VERIFY script (REQUIRED) Write to temp file → node /tmp/scraper.js → check output If failed → analyze error → fix script → retry (max 3x) Step 5: Return verified script + data preview


**Example Output:**
````markdown
## Verified Scraper (Playwright)

**Status**: ✅ Verified (extracted 50 items)

```javascript
const { chromium } = require('playwright');
// ... generated code with Actionbook selectors

Usage:

npm install playwright
node scraper.js

Data Preview

[
  {"name": "Item 1", "description": "..."},
  // ... first 3 items
]

Request Website Command

  1. User: /actionbook-scraper:request-website https://newsite.com/page
  2. Launch website-requester agent (uses agent-browser)
  3. Agent workflow: a. agent-browser open "https://actionbook.dev/request-website" b. agent-browser snapshot -i (discover form selectors) c. agent-browser type <url-field> "https://newsite.com/page" d. agent-browser type <email-field> (optional) e. agent-browser type <usecase-field> (optional) f. agent-browser click <submit-button> g. agent-browser snapshot -i (verify submission) h. agent-browser close
  4. Output: Confirmation of submission

Selector Data Structure

Actionbook returns selector data in this format:

{ "url": "https://example.com/page", "title": "Page Title", "content": "## Selector Reference\n\n| Element | CSS | XPath | Type |\n..." }

Common Selector Patterns

Card-based layouts:

Container: .card-list, .grid-container Card item: .card, .list-item Card name: .card__title, .card-name Card description: .card__description Expand button: .card__expand, button.expand

Detail extraction (dt/dd pattern):

// Common pattern for key-value pairs const items = container.querySelectorAll('.info-item'); items.forEach(item => { const label = item.querySelector('dt').textContent; const value = item.querySelector('dd').textContent; });

Table layouts:

Table: table, .data-table Header: thead th, .table-header Row: tbody tr, .table-row Cell: td, .table-cell

Page Type Detection

Indicator Page Type Template

Scroll to load more Dynamic/Infinite playwright-js (with scroll)

Click to expand Card-based playwright-js (with click)

Pagination links Paginated playwright-js (with pagination)

Static content Static puppeteer or playwright

SPA framework detected SPA playwright-js (network idle)

Output Formats

Analysis Output

Page Analysis: {url}

Matched Action

  • Action ID: {action_id}
  • Confidence: HIGH | MEDIUM | LOW

Available Selectors

ElementSelectorTypeMethods
{name}{selector}{type}{methods}

Page Structure

  • Type: {static|dynamic|spa}
  • Data Pattern: {cards|table|list}
  • Lazy Loading: {yes|no}
  • Expand/Collapse: {yes|no}

Recommendations

  • Suggested template: {template}
  • Special handling needed: {notes}

Generated Code Output

Generated Scraper

Target URL: {url} Template: {template} Expected Output: {description}

Dependencies

npm install playwright

Code

{generated_code}

Usage

node scraper.js

Output

Results saved to {output_file}

## Templates Reference

| Template | Flag | Output | Run With |
|----------|------|--------|----------|
| **agent-browser** | (default) | CLI commands | `agent-browser` CLI |
| playwright-js | --standalone | .js file | `node scraper.js` |
| playwright-python | --standalone --template playwright-python | .py file | `python scraper.py` |
| puppeteer | --standalone --template puppeteer | .js file | `node scraper.js` |

## Error Handling

| Error | Cause | Solution |
|-------|-------|----------|
| No actions found | URL not indexed | Use `/actionbook-scraper:request-website` to request indexing |
| Selectors not working | Page updated | Report to Actionbook, try alternative selectors |
| Timeout | Slow page load | Increase timeout, add retry logic |
| Empty data | Dynamic content | Add scroll/wait handling |
| Form submission failed | Network/page issue | Retry or submit manually at actionbook.dev |

## agent-browser Usage

For the `request-website` command, the plugin uses **agent-browser CLI** to automate form submission.

### agent-browser Commands

```bash
# Open a URL
agent-browser open "https://actionbook.dev/request-website"

# Get page snapshot (discover selectors)
agent-browser snapshot -i

# Type into form field
agent-browser type "input[name='url']" "https://example.com"

# Click button
agent-browser click "button[type='submit']"

# Close browser (ALWAYS do this)
agent-browser close

Selector Discovery

If form selectors are unknown, use snapshot to discover them:

agent-browser open "https://actionbook.dev/request-website"
agent-browser snapshot -i  # Returns page structure with selectors

Always Close Browser

Critical: Always run agent-browser close
 at the end of any agent-browser session, even if errors occur.

Rate Limiting

- Actionbook MCP: No rate limit for local usage

- Target websites: Respect robots.txt and add delays between requests

- Recommended: 1-2 second delay between page requests

Examples

Example 1: Generate agent-browser Script (Default)

/actionbook-scraper:generate https://firstround.com/companies

Output: agent-browser commands
```bash
agent-browser open "https://firstround.com/companies"
agent-browser scroll down 2000
agent-browser get text ".company-list-card-small"
agent-browser close

User runs these commands to scrape.

### Example 2: Generate Playwright Script

/actionbook-scraper:generate https://firstround.com/companies --standalone

Output: Playwright JavaScript code

const { chromium } = require('playwright');
// ... full script

User runs: node scraper.js

### Example 3: Analyze Page Structure

/actionbook-scraper:analyze https://example.com/products

Output: Analysis showing:

- Available selectors

- Page structure

- Recommended approach

### Example 4: Request New Website

/actionbook-scraper:request-website https://newsite.com/data

Action: Submits form to actionbook.dev (this command DOES execute agent-browser)

## Best Practices

1. **Always analyze before generating** - Understand the page structure first
2. **Check list-sources** - Verify the site is indexed before attempting
3. **Review generated code** - Verify selectors match expected elements
4. **Add appropriate delays** - Be respectful to target servers
5. **Handle edge cases** - Empty states, loading states, errors
6. **Test incrementally** - Run on small subset before full scrape

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

actionbook

No summary provided by upstream source.

Repository SourceNeeds Review
General

extract

No summary provided by upstream source.

Repository SourceNeeds Review
General

arxiv-viewer

No summary provided by upstream source.

Repository SourceNeeds Review
General

rust-learner

No summary provided by upstream source.

Repository SourceNeeds Review