Web Scraping with Intelligent Strategy Selection
When This Skill Activates
Activate automatically when user requests:
- "Scrape [website]"
- "Extract data from [site]"
- "Get product information from [URL]"
- "Find all links/pages on [site]"
- "I'm getting blocked" or "Getting 403 errors" (loads
strategies/anti-blocking.md) - "Make this an Apify Actor" (loads
apify/subdirectory) - "Productionize this scraper"
Proactive Workflow
This skill follows a systematic 5-phase approach to web scraping, always starting with interactive reconnaissance and ending with production-ready code.
Phase 1: INTERACTIVE RECONNAISSANCE (Critical First Step)
When user says "scrape X", immediately start with hands-on reconnaissance using MCP tools:
DO NOT jump to automated checks or implementation - reconnaissance prevents wasted effort and discovers hidden APIs.
Use Proxy-MCP (Traffic Interception + Stealth Browser + Humanizer):
1. Start MITM proxy and launch stealth browser
proxy_start()→ Start traffic interception proxyinterceptor_chrome_launch(url, stealthMode: true)→ Launch Chrome with anti-detection patches (webdriver, chrome.runtime, Permissions.query, Error.stack)interceptor_chrome_devtools_attach(target_id)→ Attach DevTools bridge for DOM accessinterceptor_chrome_devtools_screenshot()→ Capture initial state- Observe page loading behavior (SSR? SPA? Loading states?)
2. Analyze captured network traffic (automatic — MITM proxy captures everything)
proxy_list_traffic(url_filter: "api")→ Find REST API endpointsproxy_search_traffic(query: "application/json")→ Find JSON responsesproxy_get_exchange(exchange_id)→ Inspect full request/response detailsinterceptor_chrome_devtools_list_network(resource_types: ["xhr", "fetch"])→ Browser-side network view- Find API endpoints returning JSON (10-100x faster than HTML scraping!)
- Document headers, cookies, authentication tokens
- Extract pagination parameters
3. Test site interactions (humanizer for anti-detection)
humanizer_click(target_id, selector)→ Click with human-like Bezier curveshumanizer_type(target_id, text)→ Type with realistic WPM timinghumanizer_scroll(target_id, direction, amount)→ Smooth scrollhumanizer_idle(target_id, duration_ms)→ Idle with micro-jitterproxy_clear_traffic()before each action → Isolate API calls per interaction- Pagination: URL-based? API? Infinite scroll?
- Filtering and search: What API endpoints do they hit?
- Dynamic content loading: What traffic does scrolling trigger?
4. Assess protection mechanisms
interceptor_chrome_devtools_list_cookies(domain_filter: "cloudflare")→ Check for Cloudflareinterceptor_chrome_devtools_list_storage_keys(storage_type: "local")→ Fingerprinting markersproxy_list_traffic()→ Look for 403s, challenge pagesproxy_get_tls_fingerprints()→ Analyze TLS fingerprints- Note: stealth mode already handles most browser-level detection
5. Generate Intelligence Report
- Site architecture (framework, rendering method)
- Discovered APIs/endpoints with full specs (from traffic capture)
- Protection mechanisms and required countermeasures
- Optimal extraction strategy (Traffic Interception > Sitemap > API > DOM Scraping)
- Time/complexity estimates
See: workflows/reconnaissance.md for complete reconnaissance guide with proxy-mcp examples
Why this matters: Traffic interception automatically discovers hidden APIs (eliminating need for HTML scraping), identifies blockers before coding, and provides intelligence for optimal strategy selection. Never skip this step.
Phase 2: AUTOMATIC DISCOVERY (Validate Reconnaissance)
After Phase 1 reconnaissance, validate findings with automated checks:
1. Check for Sitemaps
# Automatically check these locations
curl -s https://[site]/robots.txt | grep -i Sitemap
curl -I https://[site]/sitemap.xml
curl -I https://[site]/sitemap_index.xml
Log findings clearly:
- "Found sitemap at /sitemap.xml with ~1,234 URLs"
- "Found sitemap index with 5 sub-sitemaps"
- "No sitemap detected at common locations"
Why this matters: Sitemaps provide instant URL discovery (60x faster than crawling)
2. Validate Discovered APIs
APIs already discovered automatically via proxy traffic capture in Phase 1. Validate them:
If APIs were found, confirm:
- Direct access works (test outside browser context)
- Pagination parameters function correctly
- Rate limits are acceptable
- No authentication barriers for target data
If no APIs were found, note:
- Site likely requires DOM scraping
- Check if content is static (→ Cheerio) or dynamic (→ DevTools bridge)
Log findings:
- "Confirmed API: GET /api/products/{id} (returns JSON, no auth)"
- "Found GraphQL endpoint: /graphql (auth required — use session tokens)"
- "No APIs detected — DOM scraping required"
3. Analyze Site Structure
Automatically assess:
- JavaScript-heavy? (Look for React, Vue, Angular indicators in traffic)
- Authentication required? (Login walls, auth tokens in captured requests)
- Page count estimate (from sitemap or API pagination metadata)
- Rate limiting indicators (robots.txt directives, 429 responses in traffic)
Phase 3: STRATEGY RECOMMENDATION
Based on Phases 1-2 findings, present 2-3 options with clear reasoning:
Example Output Template:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Analysis of example.com
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Phase 1 Intelligence (Traffic Interception):
✓ API discovered via traffic capture: GET /api/products?page=N&limit=100
✓ Framework: Next.js (SSR + CSR hybrid)
✓ Protection: Cloudflare detected, rate limit ~60/min
✓ Stealth mode handled browser detection
✗ No authentication required
Phase 2 Validation:
✓ Sitemap found: 1,234 product URLs (validates API total)
✓ Static HTML fallback available if needed
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Recommended Approaches:
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Option 1: Hybrid (Sitemap + API) [RECOMMENDED]
✓ Use sitemap to get all 1,234 product URLs instantly
✓ Extract product IDs from URLs
✓ Fetch data via API (fast, reliable JSON)
Complexity: Low-Medium
Data quality: Excellent
Speed: Very Fast
Option 2: Sitemap + DOM Scraping (DevTools Bridge)
✓ Use sitemap for URLs
✓ Scrape HTML via accessibility tree snapshots
Complexity: Medium
Data quality: Good
Speed: Fast
Option 3: Pure API (if sitemap fails)
✓ Discover product IDs through API exploration
✓ Fetch all data via API
Complexity: Medium
Data quality: Excellent
Speed: Fast
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
My Recommendation: Option 1 (Hybrid)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Reasoning:
• Sitemap gives us complete URL list (instant discovery)
• API provides clean, structured data (no HTML parsing)
• Combines speed of sitemap with reliability of API
• Best of both worlds
Proceed with Option 1? [Y/n]
Key principles:
- Always recommend the SIMPLEST approach that works
- Traffic Interception > Sitemap > API > DOM Scraping (in terms of priority)
- Show complexity and data quality
- Explain reasoning clearly
Phase 4: ITERATIVE IMPLEMENTATION
Implement scraper incrementally, starting simple and adding complexity only as needed.
Core Pattern:
- Implement recommended approach (minimal code)
- Test with small batch (5-10 items)
- Validate data quality
- Scale to full dataset or fallback
- Handle blocking if encountered
- Add robustness (error handling, retries, logging)
See: workflows/implementation.md for complete implementation patterns and code examples
Phase 5: PRODUCTIONIZATION (On Request)
Convert scraper to production-ready Apify Actor.
Activation triggers:
- "Make this an Apify Actor"
- "Productionize this scraper"
- "Deploy to Apify"
- "Create an actor from this"
Core Pattern:
- Confirm TypeScript preference (STRONGLY RECOMMENDED)
- Initialize with
apify createcommand (CRITICAL) - Port scraping logic to Actor format
- Test locally and deploy
Note: During development, proxy-mcp provides reconnaissance and traffic analysis. For production Actors, use Crawlee crawlers (CheerioCrawler/PlaywrightCrawler) on Apify infrastructure.
See: workflows/productionization.md for complete productionization workflow and apify/ directory for all Actor development guides
Quick Reference
| Task | Pattern/Command | Documentation |
|---|---|---|
| Reconnaissance | Proxy-MCP (Traffic Interception + Stealth Browser) | workflows/reconnaissance.md |
| Traffic analysis | proxy_list_traffic() + proxy_get_exchange() | strategies/traffic-interception.md |
| Find sitemaps | RobotsFile.find(url) | strategies/sitemap-discovery.md |
| Filter sitemap URLs | RequestList + regex | reference/regex-patterns.md |
| Discover APIs | Traffic capture (automatic) | strategies/api-discovery.md |
| DOM scraping | DevTools bridge + humanizer | strategies/dom-scraping.md |
| HTTP scraping | CheerioCrawler | strategies/cheerio-scraping.md |
| Hybrid approach | Sitemap + API | strategies/hybrid-approaches.md |
| Handle blocking | Stealth mode + upstream proxies | strategies/anti-blocking.md |
| Session recording | proxy_session_start() / proxy_export_har() | strategies/session-workflows.md |
| Proxy-MCP tools | Complete reference | reference/proxy-tool-reference.md |
| Fingerprint configs | Stealth + TLS presets | reference/fingerprint-patterns.md |
| Create Apify Actor | apify create | apify/cli-workflow.md |
| Template selection | Cheerio vs Playwright | workflows/productionization.md |
| Input schema | .actor/input_schema.json | apify/input-schemas.md |
| Deploy actor | apify push | apify/deployment.md |
Common Patterns
Pattern 1: Sitemap-Based Scraping
import { RobotsFile, CheerioCrawler, Dataset } from 'crawlee';
// Auto-discover and parse sitemaps
const robots = await RobotsFile.find('https://example.com');
const urls = await robots.parseUrlsFromSitemaps();
const crawler = new CheerioCrawler({
async requestHandler({ $, request }) {
const data = {
title: $('h1').text().trim(),
// ... extract data
};
await Dataset.pushData(data);
},
});
await crawler.addRequests(urls);
await crawler.run();
See examples/sitemap-basic.js for complete example.
Pattern 2: API-Based Scraping
import { gotScraping } from 'got-scraping';
const productIds = [123, 456, 789];
for (const id of productIds) {
const response = await gotScraping({
url: `https://api.example.com/products/${id}`,
responseType: 'json',
});
console.log(response.body);
}
See examples/api-scraper.js for complete example.
Pattern 3: Hybrid (Sitemap + API)
// Get URLs from sitemap
const robots = await RobotsFile.find('https://shop.com');
const urls = await robots.parseUrlsFromSitemaps();
// Extract IDs from URLs
const productIds = urls
.map(url => url.match(/\/products\/(\d+)/)?.[1])
.filter(Boolean);
// Fetch data via API
for (const id of productIds) {
const data = await gotScraping({
url: `https://api.shop.com/v1/products/${id}`,
responseType: 'json',
});
// Process data
}
See examples/hybrid-sitemap-api.js for complete example.
Directory Navigation
This skill uses progressive disclosure - detailed information is organized in subdirectories and loaded only when needed.
Workflows (Implementation Patterns)
For: Step-by-step workflow guides for each phase
workflows/reconnaissance.md- Phase 1 interactive reconnaissance (CRITICAL)workflows/implementation.md- Phase 4 iterative implementation patternsworkflows/productionization.md- Phase 5 Apify Actor creation workflow
Strategies (Deep Dives)
For: Detailed guides on specific scraping approaches
strategies/traffic-interception.md- Traffic interception via MITM proxy (primary strategy)strategies/sitemap-discovery.md- Complete sitemap guide (4 patterns)strategies/api-discovery.md- Finding and using APIsstrategies/dom-scraping.md- DOM scraping via DevTools bridgestrategies/cheerio-scraping.md- HTTP-only scrapingstrategies/hybrid-approaches.md- Combining strategiesstrategies/anti-blocking.md- Multi-layer anti-detection (stealth, humanizer, proxies, TLS)strategies/session-workflows.md- Session recording, HAR export, replay
Examples (Runnable Code)
For: Working code to reference or execute
JavaScript Learning Examples (Simple standalone scripts):
examples/sitemap-basic.js- Simple sitemap scraperexamples/api-scraper.js- Pure API approachexamples/traffic-interception-basic.js- Proxy-based reconnaissanceexamples/hybrid-sitemap-api.js- Combined approachexamples/iterative-fallback.js- Try traffic interception→sitemap→API→DOM scraping
TypeScript Production Examples (Complete Actors):
apify/examples/basic-scraper/- Sitemap + Playwrightapify/examples/anti-blocking/- Fingerprinting + proxiesapify/examples/hybrid-api/- Sitemap + API (optimal)
Reference (Quick Lookup)
For: Quick patterns and troubleshooting
reference/proxy-tool-reference.md- Proxy-MCP tool reference (all 80+ tools)reference/regex-patterns.md- Common URL regex patternsreference/fingerprint-patterns.md- Stealth mode + TLS fingerprint presetsreference/anti-patterns.md- What NOT to do
Apify (Production Deployment)
For: Creating production Apify Actors
apify/README.md- When and how to use Apifyapify/typescript-first.md- Why TypeScript for actorsapify/cli-workflow.md- apify create workflow (CRITICAL)apify/initialization.md- Complete setup guideapify/input-schemas.md- Input validation patternsapify/configuration.md- actor.json setupapify/deployment.md- Testing and deploymentapify/templates/- TypeScript boilerplate
Note: Each file is self-contained and can be read independently. Claude will navigate to specific files as needed.
Core Principles
1. Traffic Interception First
Start with the approach that gives the most intelligence:
- Traffic Interception > Sitemap > API > DOM Scraping
- MITM proxy reveals hidden APIs automatically
- Stealth mode handles anti-detection out of the box
2. Progressive Enhancement
Start with the simplest approach that works:
- Static > Dynamic
- HTTP > Browser
- Cheerio > Playwright
3. Proactive Discovery
Always investigate before implementing:
- Capture traffic to discover APIs automatically
- Check for sitemaps
- Analyze site structure via traffic patterns
4. Iterative Implementation
Build incrementally:
- Small test batch first (5-10 items)
- Validate quality
- Scale or fallback
- Add robustness last
5. Production-Ready Code
When productionizing:
- Use TypeScript (strongly recommended)
- Use
apify create(never manual setup) - Add proper error handling
- Include logging and monitoring
Remember: Traffic interception first, sitemaps second, APIs third, DOM scraping last!
For detailed guidance on any topic, navigate to the relevant subdirectory file listed above.