smart-web-scraper

Extract structured data from any web page. Supports CSS selectors, auto-detection of tables and lists, JSON/CSV output formats. Use when asked to scrape a website, extract data from a page, pull product info, gather contact details, or collect listings from a URL.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "smart-web-scraper" with this command: npx skills add mariusfit/smart-web-scraper

Smart Web Scraper

Extract structured data from web pages into clean JSON or CSV.

Quick Start

# Scrape a page, extract all text content
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com"

# Extract specific elements with CSS selector
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com/products" -s ".product-card"

# Auto-detect and extract tables
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing"

# Extract all links from a page
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com"

# Extract structured data (title, meta, headings, links)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py structure "https://example.com"

# Output as JSON
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".item" -f json

# Output as CSV
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s "table tr" -f csv

# Save to file
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://example.com" -s ".product" -f json -o products.json

# Multi-page scrape (follow pagination)
uv run --with beautifulsoup4 --with lxml python scripts/scraper.py crawl "https://example.com/page/1" --pages 5 -s ".article"

Commands

CommandArgsDescription
extract<url> [-s selector] [-f format] [-o file]Extract content, optionally filtered by CSS selector
tables<url> [-f format] [-o file]Auto-detect and extract all HTML tables
links<url> [--external] [--internal]Extract all links (href + text)
structure<url>Extract page structure: title, meta, headings, images, links
crawl<url> --pages N [-s selector] [-f format] [-o file]Follow pagination links, extract from multiple pages

Output Formats

FormatFlagDescription
Text-f textPlain text (default)
JSON-f jsonStructured JSON array
CSV-f csvComma-separated values
Markdown-f mdMarkdown-formatted

Examples

Extract product listings

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py extract "https://shop.example.com" -s ".product" -f json

Output:

[
  {"text": "Widget Pro - $29.99", "tag": "div", "class": "product"},
  {"text": "Widget Max - $49.99", "tag": "div", "class": "product"}
]

Extract pricing table

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py tables "https://example.com/pricing" -f csv

Get all external links

uv run --with beautifulsoup4 --with lxml python scripts/scraper.py links "https://example.com" --external

Rate Limiting

  • Default: 1 request per second (respectful crawling)
  • Override with --delay 0.5 (seconds between requests)
  • Respects robots.txt by default (override with --ignore-robots)

Notes

  • Requires beautifulsoup4 and lxml (auto-installed by uv run --with)
  • Uses a standard browser User-Agent to avoid blocks
  • Handles redirects, encoding detection, and error pages gracefully
  • No JavaScript rendering (use for static HTML pages)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Web Scraping & Data Extraction Engine

Complete web scraping methodology — legal compliance, architecture design, anti-detection, data pipelines, and production operations. Use when building scrap...

Registry SourceRecently Updated
0605
Profile unavailable
Automation

AutoClaw Browser Automation

Complete browser automation skill with MCP protocol support and Chrome extension

Registry SourceRecently Updated
0353
Profile unavailable
Automation

Metal Price

全球铁合金网价格查询与导出技能。自动登录www.qqthj.com网站,查询指定金属(如锰铁、钒铁等)的当日价格数据,抓取价格表格并导出为Excel文件。

Registry SourceRecently Updated
0297
Profile unavailable