Scrapling CLI

Web scraping CLI with browser impersonation, anti-bot bypass, and CSS extraction.

Prerequisites

Install with all extras (CLI needs click, fetchers need playwright/camoufox)

uv tool install 'scrapling[all]'

Install fetcher browser engines (one-time)

scrapling install

Verify: scrapling --help

Fetcher Selection

Tier Command Engine Speed Stealth JS Use When

HTTP extract get/post/put/delete

httpx + TLS impersonation Fast Medium No Static pages, APIs, most sites

Dynamic extract fetch

Playwright (headless browser) Medium Low Yes JS-rendered SPAs, wait-for-element

Stealthy extract stealthy-fetch

Camoufox (patched Firefox) Slow High Yes Cloudflare, aggressive anti-bot

Default to HTTP tier — only escalate when the page requires JS rendering or blocks HTTP requests.

Output Format

Determined by output file extension:

Extension Output Best For

.html

Raw HTML Parsing, further processing

.md

HTML converted to Markdown Reading, LLM context

.txt

Text content only Clean text extraction

Always use /tmp/scrapling-*.{md,txt,html} for output files. Read the file after extraction.

Core Commands

HTTP Tier: GET

scrapling extract get URL OUTPUT_FILE [OPTIONS]

Flag Purpose Example

-s, --css-selector

Extract matching elements only -s ".article-body"

--impersonate

Force specific browser --impersonate firefox

-H, --headers

Custom headers (repeatable) -H "Authorization: Bearer tok"

--cookies

Cookie string --cookies "session=abc123"

--proxy

Proxy URL --proxy "http://user:pass@host:port"

-p, --params

Query params (repeatable) -p "page=2" -p "limit=50"

--timeout

Seconds (default: 30) --timeout 60

--no-verify

Skip SSL verification For self-signed certs

--no-follow-redirects

Don't follow redirects For redirect inspection

--no-stealthy-headers

Disable stealth headers For debugging

Examples:

Basic page fetch as markdown

scrapling extract get "https://example.com" /tmp/scrapling-out.md

Extract only article content

scrapling extract get "https://news.site.com/article" /tmp/scrapling-out.txt -s "article"

Multiple CSS selectors

scrapling extract get "https://hn.com" /tmp/scrapling-out.txt -s ".titleline > a"

With auth header

scrapling extract get "https://api.example.com/data" /tmp/scrapling-out.txt -H "Authorization: Bearer TOKEN"

Impersonate Firefox

scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate firefox

Random browser impersonation from list

scrapling extract get "https://example.com" /tmp/scrapling-out.md --impersonate "chrome,firefox,safari"

With proxy

scrapling extract get "https://example.com" /tmp/scrapling-out.md --proxy "http://proxy:8080"

HTTP Tier: POST

scrapling extract post URL OUTPUT_FILE [OPTIONS]

Additional options over GET:

Flag Purpose Example

-d, --data

Form data -d "param1=value1&param2=value2"

-j, --json

JSON body -j '{"key": "value"}'

POST with form data

scrapling extract post "https://api.example.com/search" /tmp/scrapling-out.txt -d "q=test&page=1"

POST with JSON

scrapling extract post "https://api.example.com/query" /tmp/scrapling-out.txt -j '{"query": "test"}'

PUT and DELETE share the same interface as POST and GET respectively.

Dynamic Tier: fetch

For JS-rendered pages. Launches headless Playwright browser.

scrapling extract fetch URL OUTPUT_FILE [OPTIONS]

Flag Purpose Default

--headless/--no-headless

Headless mode True

--disable-resources

Drop images/CSS/fonts for speed False

--network-idle

Wait for network idle False

--timeout

Milliseconds 30000

--wait

Extra wait after load (ms) 0

-s, --css-selector

CSS selector extraction —

--wait-selector

Wait for element before proceeding —

--real-chrome

Use installed Chrome instead of bundled False

--proxy

Proxy URL —

-H, --extra-headers

Extra headers (repeatable) —

Fetch JS-rendered SPA

scrapling extract fetch "https://spa-app.com" /tmp/scrapling-out.md

Wait for specific element to load

scrapling extract fetch "https://dashboard.com" /tmp/scrapling-out.md --wait-selector ".data-table"

Fast mode: skip images/CSS, wait for network idle

scrapling extract fetch "https://app.com" /tmp/scrapling-out.md --disable-resources --network-idle

Extra wait for slow-loading content

scrapling extract fetch "https://lazy-site.com" /tmp/scrapling-out.md --wait 5000

Stealthy Tier: stealthy-fetch

Maximum anti-detection. Uses Camoufox (patched Firefox).

scrapling extract stealthy-fetch URL OUTPUT_FILE [OPTIONS]

Additional options over fetch :

Flag Purpose Default

--solve-cloudflare

Solve Cloudflare challenges False

--block-webrtc

Block WebRTC (prevents IP leak) False

--hide-canvas

Add noise to canvas fingerprinting False

--block-webgl

Block WebGL fingerprinting False (allowed)

Bypass Cloudflare

scrapling extract stealthy-fetch "https://cf-protected.com" /tmp/scrapling-out.md --solve-cloudflare

Maximum stealth

scrapling extract stealthy-fetch "https://aggressive-antibot.com" /tmp/scrapling-out.md
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl

Stealthy with CSS selector

scrapling extract stealthy-fetch "https://protected.com" /tmp/scrapling-out.txt
--solve-cloudflare -s ".content"

Auto-Escalation Protocol

ALL scrapling usage must follow this protocol. Never use extract get alone — always validate content and escalate if needed. Consumer skills (res-deep, res-price-compare, doc-daily-digest) MUST use this pattern, not a bare extract get .

Step 1: HTTP Tier

scrapling extract get "URL" /tmp/scrapling-out.md

Read /tmp/scrapling-out.md and validate content before proceeding.

Step 2: Validate Content

Check the scraped output for thin content indicators — signs that the site requires JS rendering:

Indicator Pattern Example

JS disabled warning "JavaScript", "enable JavaScript", "JS wyłączony" iSpot.pl, many SPAs

No product/price data Output has navigation and footer but no prices, specs, or product names E-commerce SPAs

Mostly nav links 80%+ of content is menu items, category links, cookie banners React/Angular/Vue apps

Very short content Less than ~20 meaningful lines after stripping nav/footer Hydration-dependent pages

If ANY indicator is present → escalate to Dynamic tier. Do NOT treat HTTP 200 with thin content as success.

Step 3: Dynamic Tier (if content validation fails)

scrapling extract fetch "URL" /tmp/scrapling-out.md --network-idle --disable-resources

Read and validate again. If content is now rich → done. If still blocked (403, Cloudflare challenge, empty) → escalate.

Step 4: Stealthy Tier (if Dynamic tier fails)

scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md --solve-cloudflare

If still blocked, add maximum stealth flags:

scrapling extract stealthy-fetch "URL" /tmp/scrapling-out.md
--solve-cloudflare --block-webrtc --hide-canvas --block-webgl

Consumer Skill Integration

When a consumer skill says "retry with scrapling" or "scrapling fallback", it means: follow the full auto-escalation protocol above, not just the HTTP tier. The pattern:

extract get → Read → Validate content
Content thin? → extract fetch --network-idle --disable-resources → Read → Validate
Still blocked? → extract stealthy-fetch --solve-cloudflare → Read
All tiers fail? → Skip and label "scrapling blocked"

Known JS-rendered sites (always start at Dynamic tier):

iSpot.pl — React SPA, HTTP tier returns only nav shell
Single-page apps with client-side routing (hash or history API URLs)

Interactive Shell

Launch REPL

scrapling shell

One-liner evaluation

scrapling shell -c 'Fetcher().get("https://example.com").css("title::text")'

Troubleshooting

Issue Fix

ModuleNotFoundError: click

Reinstall: uv tool install --force 'scrapling[all]'

fetch/stealthy-fetch fails Run scrapling install to install browser engines

Cloudflare still blocks Add --block-webrtc --hide-canvas to stealthy-fetch

Timeout Increase --timeout (seconds for HTTP, milliseconds for fetch/stealthy)

SSL error Add --no-verify (HTTP tier only)

Empty output with selector Try without -s first to verify page loads, then refine selector

Constraints

Output file path is required — scrapling writes to file, not stdout
CSS selectors return ALL matches concatenated
HTTP tier timeout is in seconds, fetch/stealthy-fetch timeout is in milliseconds
--impersonate only available on HTTP tier (fetch/stealthy handle it internally)
--solve-cloudflare only on stealthy-fetch tier
Stealth headers enabled by default on HTTP tier — disable with --no-stealthy-headers for debugging

cli-web-scrape

Safety Notice

Copy this and send it to your AI assistant to learn