Firecrawl & Jina Web Scraping
Firecrawl vs WebFetch
Prefer firecrawl scrape URL --only-main-content over the WebFetch tool—it produces cleaner markdown, handles JavaScript-heavy pages, and avoids content truncation (>80% benchmark coverage). WebFetch is acceptable as a fallback when Firecrawl is unavailable.
Preferred approach:
firecrawl scrape https://docs.example.com/api --only-main-content
Token-Efficient Scraping
Inspired by Anthropic's dynamic filtering—always filter before reasoning. This reduced input tokens by ~24% and improved accuracy by ~11% in their benchmarks.
The Principle: Search → Filter → Scrape → Filter → Reason
DO:
Search (titles/URLs only) → Evaluate relevance → Scrape top hits → Filter by section → Reason
DON'T:
Search → Scrape everything → Reason over all of it
Step-by-Step Efficient Workflow
Step 1: Search — get titles/URLs only (cheap)
firecrawl search "query" --limit 20
Step 2: Evaluate results, pick 3-5 best URLs
Step 3: Scrape only those, filter to relevant sections
firecrawl scrape URL1 --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
--sections "API,Authentication" --max-chars 5000
Post-Processing with filter_web_results.py
Pipe any Firecrawl or Exa output through this script to reduce context before reasoning:
Extract only matching sections from scraped page
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "Pricing,Plans"
Keep only paragraphs with keywords
firecrawl search "query" --scrape --pretty |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --keywords "pricing,cost" --max-chars 5000
Extract specific JSON fields from API output
python3 ~/.claude/skills/exa-search/scripts/exa_search.py "query" --json |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --fields "title,url,text" --max-chars 3000
Combine filters with stats
firecrawl scrape URL --only-main-content |
python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py --sections "API" --keywords "endpoint" --compact --stats
Full path: python3 ~/.claude/skills/firecrawl/scripts/filter_web_results.py
Flags: --sections , --keywords , --max-chars , --max-lines , --fields (JSON), --strip-links , --strip-images , --compact , --stats
Other Token-Saving Patterns
-
Use --only-main-content to strip navigation and footer boilerplate, reducing token consumption. Omit only when nav/footer content is specifically needed.
-
Use firecrawl map URL --search "topic" first to find relevant subpages before scraping
-
Use --format links first to get URL list, evaluate, then scrape selectively
-
Use --max-chars with exa_contents.py to cap extraction length
-
Use --formats summary (Python API script) over full text when you need the gist, not raw content
Claude API Native Tools (for API Agent Builders)
Anthropic's API now offers built-in dynamic filtering tools:
web_search_20260209 / web_fetch_20260209 Header: anthropic-beta: code-execution-web-tools-2026-02-09
These have built-in dynamic filtering via code execution. Use them when building Claude API agents directly. Use Firecrawl/Exa when you need: autonomous agents, batch scraping, structured extraction, domain-specific crawling, or when not on the Claude API.
Available Tools
- Official Firecrawl CLI (firecrawl ) — Primary
Setup: npm install -g firecrawl-cli && firecrawl login --api-key $FIRECRAWL_API_KEY
Command Purpose Quick Example
scrape
Single page → markdown firecrawl scrape URL --only-main-content
crawl
Entire site with progress firecrawl crawl URL --wait --progress --limit 50
map
Discover all URLs on a site firecrawl map URL --search "API"
search
Web search (+ optional scrape) firecrawl search "query" --limit 10
Full CLI reference: references/cli-reference.md
- Auto-Save Alias (fc-save ) — Shell Alias
Requires shell alias setup (not bundled with this skill).
fc-save URL
→ Saves to ~/Desktop/Screencaps & Chats/Web-Scrapes/docs-example-com-api.md
- Python API Script (firecrawl_api.py ) — Advanced Features
Command: python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py <command>
Requires: FIRECRAWL_API_KEY env var, pip install firecrawl-py requests
Command Purpose Quick Example
search
Web search with scraping firecrawl_api.py search "query" -n 10
scrape
Single URL with page actions firecrawl_api.py scrape URL --formats markdown summary
batch-scrape
Multiple URLs concurrently firecrawl_api.py batch-scrape URL1 URL2 URL3
crawl
Website crawling firecrawl_api.py crawl URL --limit 20
map
URL discovery firecrawl_api.py map URL --search "query"
extract
LLM-powered structured extraction firecrawl_api.py extract URL --prompt "Find pricing"
agent
Autonomous extraction (no URLs needed) firecrawl_api.py agent "Find YC W24 AI startups"
parallel-agent
Bulk agent queries (v2.8.0+) firecrawl_api.py parallel-agent "Q1" "Q2" "Q3"
Agent models: spark-1-fast (10 credits, simple), spark-1-mini (default), spark-1-pro (thorough)
Full Python API reference: references/python-api-reference.md
- DeepWiki — GitHub Repo Documentation
~/.claude/skills/firecrawl/scripts/deepwiki.sh <owner/repo> [section] [options]
AI-generated wiki for any public GitHub repo. No API key required.
Overview
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat
Browse sections
~/.claude/skills/firecrawl/scripts/deepwiki.sh langchain-ai/langchain --toc
Specific section
~/.claude/skills/firecrawl/scripts/deepwiki.sh karpathy/nanochat 4.1-gpt-transformer-implementation
Full dump for RAG
~/.claude/skills/firecrawl/scripts/deepwiki.sh openai/openai-python --all --save
- Jina Reader (jina ) — Fallback
Use when Firecrawl fails or for Twitter/X URLs (Firecrawl blocks Twitter, Jina works).
jina https://x.com/username/status/123456
Firecrawl vs Exa vs Native Claude Tools
Need Best Tool Why
Single page → markdown firecrawl scrape --only-main-content
Cleanest output
Search + scrape in one shot firecrawl search --scrape
Combined operation
Crawl entire site firecrawl crawl --wait --progress
Link following + progress
Autonomous data finding firecrawl_api.py agent
No URLs needed
Semantic/neural search Exa exa_search.py
AI-powered relevance
Find research papers Exa --category "research paper"
Academic index
Quick research answer Exa exa_research.py
Citations + synthesis
Find similar pages Exa exa_similar.py
Competitive analysis
Claude API agent building Native web_search_20260209
Built-in dynamic filtering
Twitter/X content jina URL
Only tool that works
GitHub repo docs deepwiki.sh owner/repo
AI-generated wiki
Anti-bot / Cloudflare bypass scrapling stealth fetch Local Turnstile solver
Element-level extraction scrapling
- CSS selectors Precision targeting, adaptive tracking
No API key scraping scrapling HTTP fetch 100% local, no credentials
Site redesign resilience scrapling adaptive mode SQLite similarity matching
Common Workflows
Single Page Scraping
firecrawl scrape https://example.com/page --only-main-content
Or auto-save: fc-save URL
Or to file: firecrawl scrape URL --only-main-content -o page.md
Documentation Crawling
Map first, then crawl relevant paths
firecrawl map https://docs.example.com --search "API" firecrawl crawl https://docs.example.com --include-paths /api,/guides --wait --progress
Research Workflow
firecrawl search "machine learning best practices 2026" --scrape --scrape-formats markdown
Agent-Powered Research (No URLs Needed)
python3 ~/.claude/skills/firecrawl/scripts/firecrawl_api.py agent
"Compare pricing tiers for Firecrawl, Apify, and ScrapingBee"
Troubleshooting
Check status and credits
firecrawl --status && firecrawl credit-usage
Re-authenticate
firecrawl logout && firecrawl login --api-key $FIRECRAWL_API_KEY
Check API key
echo $FIRECRAWL_API_KEY
-
Scrape fails: Try jina URL , or add --wait-for 3000 for JS-heavy sites
-
Async job stuck: Check with crawl-status /batch-status , cancel with crawl-cancel /batch-cancel
-
Disable telemetry: export FIRECRAWL_NO_TELEMETRY=1
Reference Documentation
File Contents
references/cli-reference.md
Full CLI parameter reference (scrape, crawl, map, search, fc-save, jina, deepwiki)
references/python-api-reference.md
Full Python API script reference (all commands, SDK examples)
references/firecrawl-api.md
Firecrawl Search API reference
references/firecrawl-agent-api.md
Agent API (spark models, parallel agents, webhooks)
references/actions-reference.md
Page actions for dynamic content (click, write, wait, scroll)
references/branding-format.md
Brand identity extraction (colors, fonts, UI)
Test Suite
python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --quick # Quick validation python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py # Full suite python3 ~/.claude/skills/firecrawl/scripts/test_firecrawl.py --test scrape # Specific test