web-scraper

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JavaScript-heavy sites, and structured output (JSON/CSV). Use when Codex needs to extract data from websites, handle pagination, bypass simple anti-bot measures, scrape JavaScript-rendered content, or process scraped data into usable formats.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-scraper" with this command: npx skills add ericlooi504/python-web-scraper

Web Scraper

Overview

Python web scraping toolkit for data extraction, pagination handling, anti-blocking techniques, Selenium for JS-heavy sites, and structured output. Covers ethical scraping practices. Use when Codex needs to extract data from websites, handle pagination, bypass simple anti-bot measures, or scrape JavaScript-rendered content.

Quick Start

Prerequisites

pip install requests beautifulsoup4 lxml
# For JS-heavy sites:
pip install selenium webdriver-manager

Basic scrape

# Extract all links from a page
python3 scripts/scrape-basic.py https://example.com \
  --selector "a[href]" --attr href --output links.json --pretty

# Extract text from articles
python3 scripts/scrape-basic.py https://news.ycombinator.com \
  --selector ".titleline a" --output hn.txt

Paginated scrape

# URL parameter pagination (?page=1, ?page=2)
python3 scripts/scrape-pagination.py https://books.toscrape.com/catalogue/page-1.html \
  --selector "h3 a" --attr title --max-pages 5

# Next-link detection
python3 scripts/scrape-pagination.py https://quotes.toscrape.com \
  --selector "span.text" --max-pages 3

JavaScript-rendered pages (Selenium)

python3 scripts/scrape-with-selenium.py https://example.com \
  --selector ".dynamic-content" --wait 5 --output data.json

Common Scenarios

Anti-blocking techniques

Rotate User-Agents and add delays to avoid 429/blocking:

import random
import time
headers = {
    "User-Agent": random.choice(USER_AGENTS),
    "Accept": "text/html,application/xhtml+xml",
    "Accept-Language": "en-US,en;q=0.9",
    "Referer": "https://www.google.com/",
}
time.sleep(random.uniform(1.0, 3.0))  # random delay between requests

For aggressive blocking: set cookies, use sessions, or add proxy.

Handle JavaScript sites without Selenium

First check: is the data embedded in the page source?

import re, json
# Look for JSON data in <script> tags
match = re.search(r'window\.__INITIAL_STATE__\s*=\s*({.*?});', html)
if match:
    data = json.loads(match.group(1))

Many SPAs (React/Vue) embed data in script tags — Selenium may be unnecessary.

Handle login-protected pages

# Option 1: Export cookies from browser
# In browser console: document.cookie or use EditThisCookie extension
# Option 2: Use requests Session
python3 -c "
import requests
s = requests.Session()
s.post('https://example.com/login', data={'user': '...', 'pass': '...'})
with open('cookies.txt', 'w') as f:
    f.write(str(s.cookies.get_dict()))
"

Output formatting

Scripts output JSON by default. Convert to CSV:

# JSON → CSV using jq
python3 scrape-basic.py https://example.com -s "tr" -o data.json --pretty
python3 -c "
import json, csv
with open('data.json') as f:
    data = json.load(f)
with open('data.csv', 'w', newline='') as f:
    w = csv.writer(f)
    w.writerow(['item'])
    for d in data:
        w.writerow([d])
"

Ethics & Legal

  • Always check robots.txt first: https://example.com/robots.txt
  • Respect Crawl-delay directive
  • Identify yourself in User-Agent with contact info
  • Never scrape login-protected content, personal data, or copyrighted material
  • Add delays (1-3s minimum) between requests — don't hammer servers
  • Check ToS, some sites explicitly ban scraping
  • For public data (news, blogs, directories): generally fine with proper rate limiting

Resources

  • scripts/scrape-basic.py — Single page scrape with CSS selectors, JSON/CSV/text output
  • scripts/scrape-pagination.py — Paginated scrape (URL params + next-link detection)
  • scripts/scrape-with-selenium.py — Selenium-based scrape for JS-heavy sites with scroll
  • references/anti-blocking.md — Detailed anti-blocking and proxy strategies

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Sui Sec

Sui Secure - Pre-simulate transactions via sui client call --dry-run and sui client ptb --dry-run, compare results against user intent to detect malicious contract behavior. Only execute if intent matches; block otherwise.

Registry SourceRecently Updated
Coding

Hot Fun Integration

CLI tool to create meme tokens on hot.fun (Solana). Returns structured JSON for token creation flow.

Registry SourceRecently Updated
Coding

Everything Claude Code

Provides performance optimization guides and best practices for Claude AI agents, including CLI tools and Anthropic API integration details.

Registry SourceRecently Updated
Coding

ClawGator Superpowers

Framework pengembangan perangkat lunak lengkap untuk tim ClawGator. Brainstorming, planning, eksekusi sistematis, TDD, debugging, code review, dan git worktrees. Trigger otomatis sebelum memulai proyek atau perubahan kode.

Registry SourceRecently Updated
2.2K3renggap