Web Markdown Scraper
Use this skill when the user wants to:
- Scrape one or more public webpages (static or JavaScript-rendered)
- Convert HTML pages into clean Markdown
- Extract article/body text for summarization, analysis, or indexing
- Bypass anti-bot protections (Cloudflare, Datadome, etc.) via stealth mode
- Scrape many URLs concurrently (async mode)
- Track page elements reliably across website redesigns (automatch)
- Save the extracted results as
.mdfiles
Fetcher Mode Selection Guide
| Mode | Fetcher Class | Best For |
|---|---|---|
http (default) | Fetcher | Fast static pages, RSS, APIs |
async | AsyncFetcher | Batch of 5+ static URLs in parallel |
stealth | StealthyFetcher | Anti-bot sites, Cloudflare, fingerprint checks |
dynamic | PlayWrightFetcher | Heavy SPAs, React/Vue/Angular apps |
Decision rule: Start with http. If you get a 403 / CAPTCHA / empty body, switch
to stealth. If the content is rendered client-side (empty on first load), use dynamic.
Use async when scraping many static URLs at once to save time.
Inputs
URL sources
--url URL— one target URL (repeat flag for multiple:--url A --url B)--url-file FILE— plain text file with one URL per line
Fetcher
--mode http|async|stealth|dynamic— fetcher backend (default:http)
Content extraction
--selector CSS— CSS selector for the main content area (omit = full page)--preserve-links— keep hyperlinks in the Markdown output--output-dir DIR— save per-page.mdfiles and a masterindex.jsonhere
AutoMatch — production resilience
--auto-save— fingerprint & persist selected elements to the local DB on first run--auto-match— on subsequent runs, find elements by fingerprint even if the site layout has changed (do NOT need to update the CSS selector)
Browser options (stealth / dynamic only)
--headless true|false|virtual— headless mode;virtualuses Xvfb (default:true)--network-idle— wait until no network activity for ≥500 ms before capturing--block-images— block image loading (saves bandwidth and proxy quota)--disable-resources— drop fonts/images/media/stylesheets for ~25% faster loads--wait-selector CSS— pause until this element appears in the DOM--wait-selector-state attached|visible|detached|hidden— element state (default:attached)--timeout MS— global timeout in ms (default: 30 000)--wait MS— extra idle wait after page load in ms
StealthyFetcher extras (stealth mode only)
--humanize SECONDS— simulate human-like cursor movement (max duration in seconds)--geoip— spoof browser timezone, locale, language, and WebRTC IP from proxy geolocation--block-webrtc— prevent real-IP leaks via WebRTC--disable-ads— install uBlock Origin in the browser session--proxy URL— HTTP/SOCKS proxy as a URL string, or JSON:'{"server":"host:port","username":"u","password":"p"}'
Reliability
--retry N— retry failed requests up to N times with exponential backoff (max 30 s)
Rules
- Only process public
http://orhttps://pages. - Never bypass login walls, CAPTCHAs, paywalls, or access controls.
- Prefer the main article or body content; avoid polluting the output with navigation,
headers, footers, or cookie banners — use
--selectorto target the content area. - When
--auto-saveis used, always also pass--selectorso Scrapling knows which element fingerprint to record. - On subsequent runs for layout-changed pages, use
--auto-matchinstead of--auto-save. Do not use both flags at once. - Use
--mode asyncfor batch jobs with 5+ static URLs for parallel execution. - Combine
--disable-resourceswith--block-imagesin stealth/dynamic mode when you only need text content — this can cut load times by up to 40%. - Always inspect the top-level
okfield and per-resultokfields before using content. - If
okisfalse, report the exacterrorstring — do not invent or guess content. - When
--network-idleis insufficient, use--wait-selectorfor a specific DOM element to guarantee the content has loaded before capture.
Command Patterns
Basic static page
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>"
Static page — target specific content area
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --selector "article.main-content"
Stealth mode — bypass anti-bot protection
python3 "{baseDir}/scrape_to_markdown.py" --url "<URL>" --mode stealth --network-idle
Stealth + proxy + human fingerprint (maximum stealth)
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--proxy "http://user:pass@host:port" \
--humanize 2.0 \
--geoip \
--block-webrtc \
--network-idle
Dynamic SPA page (Playwright Chromium)
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode dynamic \
--wait-selector ".product-list" \
--network-idle \
--disable-resources
Async concurrent batch (multiple URLs)
python3 "{baseDir}/scrape_to_markdown.py" \
--mode async \
--url "<URL1>" --url "<URL2>" --url "<URL3>"
Batch from file + stealth + save to disk
python3 "{baseDir}/scrape_to_markdown.py" \
--url-file urls.txt \
--mode stealth \
--disable-resources \
--output-dir outputs
First-run automatch setup (save fingerprint)
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-save \
--output-dir outputs
Subsequent run after site layout change (adaptive match)
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--selector ".article-body" \
--auto-match \
--output-dir outputs
Full production scrape
python3 "{baseDir}/scrape_to_markdown.py" \
--url "<URL>" \
--mode stealth \
--selector "main article" \
--auto-match \
--preserve-links \
--network-idle \
--disable-resources \
--timeout 60000 \
--retry 3 \
--output-dir outputs
Output Handling
JSON is printed to stdout. Always check ok before using content.
Top-level fields:
ok—trueonly if every URL succeededtotal/succeeded/failed— count summaryresults— array of per-URL result objectsoutput_index_file— path to savedindex.json(if--output-dirused)
Per-URL result fields (when ok: true):
url— the requested URLstatus— HTTP status code (e.g.200)title— page<title>textmarkdown— extracted content as Markdown ← use this as main contentmarkdown_length— character count (useful for quality checks)output_markdown_file— path to saved.mdfile (if--output-dirused)
On failure (ok: false in a result):
error— exact error message; report this verbatim, do not invent content