Spider CLI Extraction
Overview
Use this skill to run Spider CLI workflows with explicit runtime mode control.
Canonical source for cross-agent behavior: skills/core/spider-cli-extraction.md
Load references/cli-workflows.md when you need exact command patterns or mode-selection rules.
Workflow
-
Confirm CLI availability.
-
Prefer cargo run -p spider_cli -- ... from the Spider repo root.
-
If spider is globally installed, use spider ... for quick checks.
-
Choose the task mode.
-
Use crawl to collect links.
-
Use scrape to emit per-page JSON records and optionally include HTML.
-
Use download to persist page markup to disk.
-
Select runtime execution mode.
-
Use --headless for browser-rendered mode.
-
Use --http to force HTTP-only mode.
-
Omit both for default HTTP behavior.
-
Add scope controls.
-
Set --limit , --depth , --budget , and --blacklist-url .
-
Add --respect-robots-txt when policy compliance is required.
Quick Commands
Crawl links (default HTTP mode)
cargo run -p spider_cli -- --url https://example.com crawl --output-links
Browser mode on demand
cargo run -p spider_cli -- --url https://example.com --headless crawl --output-links
Scrape with HTML output
cargo run -p spider_cli -- --url https://example.com scrape --output-html
Script
Use scripts/spider_cli_helper.sh for wrappers:
./scripts/spider_cli_helper.sh verify-headless ./scripts/spider_cli_helper.sh crawl https://example.com --limit 20 --depth 2 ./scripts/spider_cli_helper.sh scrape https://example.com --output-html --output-links