crawler

Converts any URL into clean markdown using a robust 3-tier fallback chain.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "crawler" with this command: npx skills add gn00678465/crawler-skill/gn00678465-crawler-skill-crawler

Crawler Skill

Converts any URL into clean markdown using a robust 3-tier fallback chain.

Quick start

uv run scripts/crawl.py --url https://example.com --output reports/example.md

Markdown is saved to the file specified by --output . Progress/errors go to stderr. Exit code 0 on success, 1 if all scrapers fail.

How it works

The script tries each tier in order and returns the first success:

Tier Module Requires

1 Firecrawl (firecrawl_scraper.py ) FIRECRAWL_API_KEY env var (optional; falls back if missing)

2 Jina Reader (jina_reader.py ) Nothing — free, no key needed

3 Scrapling (scrapling_scraper.py ) Local headless browser (auto-installs via pip)

File layout

crawler-skill/ ├── SKILL.md ← this file ├── scripts/ │ ├── crawl.py ← main CLI entry point (PEP 723 inline deps) │ └── src/ │ ├── domain_router.py ← URL-to-tier routing rules │ ├── firecrawl_scraper.py ← Tier 1: Firecrawl API │ ├── jina_reader.py ← Tier 2: Jina r.jina.ai proxy │ └── scrapling_scraper.py ← Tier 3: local headless scraper └── tests/ └── test_crawl.py ← 70 pytest tests (all passing)

Usage examples

Basic fetch — tries Firecrawl, falls back to Jina, then Scrapling

Always prefer using --output to avoid terminal encoding issues

uv run scripts/crawl.py --url https://docs.python.org/3/ --output reports/python_docs.md

If no --output is provided, markdown goes to stdout (not recommended on Windows)

uv run scripts/crawl.py --url https://example.com

With a Firecrawl API key for best results

FIRECRAWL_API_KEY=fc-... uv run scripts/crawl.py --url https://example.com --output reports/example.md

URL requirements

Only http:// and https:// URLs are accepted. Passing any other scheme (ftp:// , file:// , javascript: , a bare path, etc.) exits with code 1

and prints a clear error — no scraping is attempted.

Saving Reports

When the user asks to save the crawled content or a summary to a file, ALWAYS use the --output argument and save the file into the reports/ directory at the project root (for example, {project_root}/reports ). If the directory does not exist, the script will create it.

Example: If asked to "save to result.md", you should run: uv run scripts/crawl.py --url <URL> --output reports/result.md

Point at a self-hosted Firecrawl instance

FIRECRAWL_API_URL=http://localhost:3002 uv run scripts/crawl.py --url https://example.com

Content validation

Each scraper validates its output before returning success:

  • Minimum 100 characters of content (rejects empty/error pages)

  • Detection of CAPTCHA / bot-verification pages (Firecrawl)

  • Detection of Cloudflare interstitial pages (Scrapling — escalates to StealthyFetcher)

  • Detection of Jina error page indicators (Error: , Access Denied , etc.)

Domain routing

Certain hostnames bypass one or more scraper tiers to avoid known compatibility issues. The logic lives in scripts/src/domain_router.py .

Domain Skipped tiers Active chain

medium.com (and subdomains) firecrawl jina → scrapling

mp.weixin.qq.com

firecrawl + jina scrapling only

everything else — firecrawl → jina → scrapling

Sub-domain matching follows a suffix rule: blog.medium.com matches the medium.com rule because its hostname ends with .medium.com . An exact sub-domain like other.weixin.qq.com does not match mp.weixin.qq.com .

Running tests

uv run pytest tests/ -v

All 70 tests use mocking — no network calls, no API keys required.

Dependencies (auto-installed by uv run )

  • firecrawl-py>=2.0 — Firecrawl Python SDK

  • httpx>=0.27 — HTTP client for Jina Reader

  • scrapling>=0.2 — Headless scraping with stealth support

  • html2text>=2024.2.26 — HTML-to-markdown conversion

When to invoke this skill

Invoke crawl.py whenever you need the text content of a web page:

result = subprocess.run( ["uv", "run", "scripts/crawl.py", "--url", url], capture_output=True, text=True ) if result.returncode == 0: markdown = result.stdout

Or simply run it directly from the terminal as shown in Quick start above.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Daily Game News

自动抓取并分类多平台每日游戏资讯,生成格式化报告发送飞书并存档为Word和文本文件。

Registry SourceRecently Updated
0154
Profile unavailable
Coding

X News Crawler

Crawl X (Twitter) search results through a local CLI that wraps `abs` (agent-browser). Use when the user asks to scrape X posts by keyword, collect Top/Lates...

Registry SourceRecently Updated
0333
Profile unavailable
Research

Data Harvester Pro

Batch web scraping for competitor analysis, price monitoring and market research

Registry SourceRecently Updated
0163
Profile unavailable