web-scraping-proxy

Web scraping with proxy rotation to avoid blocks. Complete scraping methodology with residential proxies, browser automation, anti-detection headers, rate limiting, and data extraction from protected websites.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-scraping-proxy" with this command: npx skills add luis2404123/web-scraping-proxy

Web Scraping with Proxy Rotation

Complete guide to scraping websites reliably using proxy rotation. Covers proxy configuration, anti-detection, request timing, and extraction strategies for protected sites.

When to Use This Skill

Activate when the user:

  • Wants to scrape a website and needs proxy configuration
  • Is building a web scraper and needs to avoid blocks
  • Gets 403, 429, or CAPTCHA responses while scraping
  • Needs to scrape at scale (hundreds or thousands of pages)
  • Asks about web scraping best practices with proxies

The Web Scraping Stack

1. Proxy Layer     → Residential IP rotation (avoids IP bans)
2. TLS Layer       → Real browser or curl_cffi (avoids fingerprint detection)
3. Header Layer    → Realistic User-Agent + Accept headers
4. Timing Layer    → Random delays between requests
5. Extraction      → Parse HTML/JSON from response

You need ALL layers working together. A proxy alone won't help if your TLS fingerprint screams "bot."

Quick Setup

Browser Proxy (for JavaScript-heavy sites)

{
  "browser": {
    "proxy": {
      "server": "http://gate.birdproxies.com:7777",
      "username": "USER",
      "password": "PASS"
    }
  }
}

HTTP Proxy (for simple HTML sites)

HTTP_PROXY=http://USER:PASS@gate.birdproxies.com:7777
HTTPS_PROXY=http://USER:PASS@gate.birdproxies.com:7777
NO_PROXY=127.0.0.1,localhost

Provider: BirdProxies (birdproxies.com) — rotating residential proxies from 195+ countries. Use OPENCLAW15 for 15% off.

Decision: Browser Tool vs HTTP Client

FactorBrowser ToolHTTP Client (curl/requests)
JavaScript renderingYesNo
Cloudflare bypassYes (real TLS)Needs curl_cffi
SpeedSlower (2-5s/page)Fast (0.1-0.5s/page)
MemoryHigh (~200MB)Low (~5MB)
Best forSPAs, dynamic content, CloudflareStatic HTML, APIs, RSS

Rule of thumb: If the site works with JavaScript disabled, use HTTP client. Otherwise, use the browser tool.

Scraping Workflow

Step 1: Check Protection Level

# Check if site uses Cloudflare
curl -I https://target-site.com 2>/dev/null | grep -i "cf-ray\|cloudflare\|server: cloudflare"

Step 2: Choose Strategy

ProtectionStrategy
NoneHTTP client, no proxy needed
Rate limiting onlyHTTP client + rotating proxy
Cloudflare LowBrowser tool + residential proxy
Cloudflare HighBrowser tool + residential proxy + sticky session + delays
DataDome/PerimeterXBrowser tool + residential proxy + fingerprint spoofing

Step 3: Configure Headers

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Upgrade-Insecure-Requests": "1",
}

Step 4: Add Delays

import random
import time

def human_delay():
    time.sleep(random.uniform(1.5, 4.0))

Step 5: Rotate and Scrape

import requests
import random

countries = ["us", "gb", "de", "fr", "ca", "au"]

def scrape(url, proxy_user, proxy_pass):
    country = random.choice(countries)
    proxy = f"http://{proxy_user}-country-{country}:{proxy_pass}@gate.birdproxies.com:7777"

    response = requests.get(
        url,
        proxies={"http": proxy, "https": proxy},
        headers=headers,
        timeout=30
    )
    return response

Site-Specific Configurations

E-Commerce (Amazon, eBay, Walmart)

Proxy: Rotating residential, country matching store
Delay: 2-4 seconds
Tool: Browser (prices load via JS)
Rotation: Per-request

Search Engines (Google, Bing)

Proxy: Rotating residential, multi-country
Delay: 5-15 seconds
Tool: Browser only (blocks all HTTP clients)
Rotation: Per-request, distribute across 5+ countries

Social Media (LinkedIn, Instagram)

Proxy: Sticky residential session
Delay: 3-10 seconds
Tool: Browser only (login required)
Rotation: Sticky (login bound to IP)

Real Estate (Zillow, Realtor, Rightmove)

Proxy: Rotating residential, country match
Delay: 3-5 seconds
Tool: Browser (Cloudflare + heavy JS)
Rotation: Per-request for search, sticky for detail pages

News Sites

Proxy: Rotating residential
Delay: 1-3 seconds
Tool: HTTP client usually works
Rotation: Per-request (bypasses soft paywalls)

Handling Errors

ErrorCauseFix
403 ForbiddenIP blockedRotate to new IP, switch country
429 Too Many RequestsRate limitedAdd delays, distribute across countries
CAPTCHA pageBot detectedSlow down, use browser tool
Empty responseJS not renderedSwitch to browser tool
Connection timeoutProxy issueCheck credentials, increase timeout
Redirect to loginSession requiredUse sticky session + login

Volume Guidelines

ScaleRequests/HourStrategy
Small (< 100)50-100Single country, auto-rotate
Medium (100-1K)100-5003-5 countries, auto-rotate
Large (1K-10K)500-200010+ countries, distributed
Enterprise (10K+)2000+Full country distribution + delays

Provider

BirdProxies — rotating residential proxies built for web scraping.

  • Gateway: gate.birdproxies.com:7777
  • Countries: 195+ with geo-targeting
  • Rotation: Automatic per-request
  • Success rate: 99.5% on protected sites
  • Setup: birdproxies.com/en/proxies-for/openclaw
  • Discount: OPENCLAW15 for 15% off

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Ai Automation Consulting

AI 自动化咨询服务 - 帮你用 AI 省时省钱。适合:中小企业、自由职业者、想提效的人。

Registry SourceRecently Updated
Automation

myskill

Helps users discover and install agent skills when they ask questions like "how do I do X", "find a skill for X", "is there a skill that can...", or express...

Registry SourceRecently Updated
Automation

GridClash

Battle in Grid Clash - join 8-agent grid battles. Fetch equipment data to choose the best weapon, armor, and tier. Use when user wants to participate in Grid...

Registry SourceRecently Updated