Rust Web Crawler (rcrawler)

High-performance web crawler built in pure Rust with production-grade features for fast, reliable site crawling.

When to Use This Skill

Use this skill when the user requests:

Web crawling or site mapping
Sitemap discovery and analysis
Link extraction and validation
Site structure visualization
robots.txt compliance checking
Performance-critical web scraping
Generating interactive web reports with graph visualization

Core Capabilities

🚀 Performance

60+ pages/sec throughput with async Tokio runtime
<50ms startup time - Near-instant initialization
~50MB memory usage - Efficient resource consumption
5.4 MB binary - Single executable, no dependencies

🤖 Intelligence

Sitemap discovery: Automatically finds and parses sitemap.xml (3 standard locations)
robots.txt compliance: Respects crawling rules with per-domain caching
Smart filtering: Auto-excludes images, CSS, JS, PDFs by default
Domain auto-detection: Extracts and restricts to base domain automatically

🔒 Safety

Rate limiting: Token bucket algorithm (default 2 req/s)
Configurable timeout: 30 second default
Memory safe: Rust's ownership system prevents crashes
Graceful shutdown: 2-second grace period for pending requests

📊 Output

Multiple formats: JSON, Markdown, HTML, CSV, Links, Text
LLM-ready Markdown: Clean content with YAML frontmatter
Interactive HTML report: Dashboard with graph visualization
Stealth mode: User-agent rotation and realistic headers
Content filtering: Remove nav, ads, scripts for clean data
Real-time progress: Updates every 5 seconds during crawl

📝 Monitoring

Structured logging: tracing with timestamps and log levels
Progress tracking: [Progress] Pages: X/Y | Active jobs: Z | Errors: N
Detailed statistics: Pages found, crawled, external links, errors, duration

Installation & Setup

Binary Location

~/.claude/skills/web-crawler/bin/rcrawler

Build from Source

Clone the repository

git clone https://github.com/leobrival/rcrawler.git cd rcrawler

Build release binary

cargo build --release

Copy to skill directory

cp target/release/rcrawler ~/.claude/skills/web-crawler/bin/

Build time: ~2 minutes Binary size: 5.4 MB

Command Line Interface

Basic Syntax

~/.claude/skills/web-crawler/bin/rcrawler <URL> [OPTIONS]

Options

Core Options:

-w, --workers <N> : Number of concurrent workers (default: 20, range: 1-50)
-d, --depth <N> : Maximum crawl depth (default: 2)
-r, --rate <N> : Rate limit in requests/second (default: 2.0)

Configuration:

-p, --profile <NAME> : Use predefined profile (fast/deep/gentle)
--domain <DOMAIN> : Restrict to specific domain (auto-detected from URL)
-o, --output <PATH> : Custom output directory (default: ./output)

Features:

-s, --sitemap : Enable/disable sitemap discovery (default: true)
--stealth : Enable stealth mode with user-agent rotation
--markdown : Convert HTML to LLM-ready Markdown with frontmatter
--filter-content : Enable content filtering (remove nav, ads, scripts)
--debug : Enable debug logging with detailed trace information
--resume : Resume from checkpoint if available

Output:

-f, --formats <LIST> : Output formats (json,markdown,html,csv,links,text)

Profiles

Fast Profile (Quick Mapping)

~/.claude/skills/web-crawler/bin/rcrawler <URL> -p fast

Workers: 50
Depth: 3
Rate: 10 req/s
Use case: Quick site structure overview

Deep Profile (Comprehensive Crawl)

~/.claude/skills/web-crawler/bin/rcrawler <URL> -p deep

Workers: 20
Depth: 10
Rate: 3 req/s
Use case: Complete site analysis

Gentle Profile (Server-Friendly)

~/.claude/skills/web-crawler/bin/rcrawler <URL> -p gentle

Workers: 5
Depth: 5
Rate: 1 req/s
Use case: Respecting server resources

Usage Examples

Example 1: Basic Crawl

~/.claude/skills/web-crawler/bin/rcrawler https://example.com

Output:

[2026-01-10T01:17:27Z] INFO Starting crawl of: https://example.com [2026-01-10T01:17:27Z] INFO Config: 20 workers, depth 2 Fetching sitemap URLs... [Progress] Pages: 50/120 | Active jobs: 15 | Errors: 0 [Progress] Pages: 100/180 | Active jobs: 8 | Errors: 0

Crawl complete! Pages crawled: 150 Duration: 8542ms Results saved to: ./output/results.json HTML report: ./output/index.html

Example 2: Stealth Mode with Markdown Export

~/.claude/skills/web-crawler/bin/rcrawler https://docs.example.com
--stealth --markdown -f markdown -d 3

Use case: Content extraction for LLM/RAG pipelines Expected: Clean Markdown with frontmatter, anti-detection headers

Example 3: Fast Scan

~/.claude/skills/web-crawler/bin/rcrawler https://blog.example.com -p fast

Use case: Quick blog mapping Expected: 50 workers, depth 3, ~3-5 seconds for 100 pages

Example 4: Multi-Format Export

~/.claude/skills/web-crawler/bin/rcrawler https://example.com
-f json,markdown,csv,links -o ./export

Use case: Export data in multiple formats simultaneously Expected: Generates results.json, results.md, results.csv, results.txt

Example 5: Debug Mode

~/.claude/skills/web-crawler/bin/rcrawler https://example.com --debug

Output: Detailed trace logs for troubleshooting

Output Format

Directory Structure

./output/ ├── results.json # Structured crawl data ├── results.md # LLM-ready Markdown (with --markdown) ├── results.html # Interactive report ├── results.csv # Spreadsheet format (with -f csv) ├── results.txt # URL list (with -f links) └── checkpoint.json # Auto-saved state (every 30s)

JSON Structure (results.json)

{ "stats": { "pages_found": 450, "pages_crawled": 450, "external_links": 23, "excluded_links": 89, "errors": 0, "start_time": "2026-01-10T01:00:00Z", "end_time": "2026-01-10T01:00:07Z", "duration": 7512 }, "results": [ { "url": "https://example.com", "title": "Example Domain", "status_code": 200, "depth": 0, "links": ["https://example.com/page1", "..."], "crawled_at": "2026-01-10T01:00:01Z", "content_type": "text/html" } ] }

HTML Report Features

Interactive dashboard with key statistics
Graph visualization using force-graph library
Node sizing based on link count (logarithmic scale)
Status color coding: Green (success), red (errors)
Hover tooltips: In-degree and out-degree information
Click to navigate: Opens page URL in new tab
Light/dark mode: Auto-detection via CSS
Collapsible sections: Reduces scroll for large crawls
Mobile responsive: Works on all devices

Implementation Workflow

When a user requests a crawl, follow these steps:

Parse Request

Extract from user message:

URL (required): Target website
Workers (optional): Number of concurrent workers
Depth (optional): Maximum crawl depth
Rate (optional): Requests per second
Profile (optional): fast/deep/gentle

Validate Input

Check URL format (add https:// if missing)
Validate workers range (1-50)
Validate depth (1-10 recommended)
Validate rate (0.1-20.0 recommended)

Build Command

~/.claude/skills/web-crawler/bin/rcrawler <URL>
-w <workers>
-d <depth>
-r <rate>
[--debug]
[-o <output>]

Execute Crawl

Use Bash tool to run the command:

~/.claude/skills/web-crawler/bin/rcrawler https://example.com -w 20 -d 2

Monitor Progress

Watch for progress updates in output:

[Progress] Pages: X/Y | Active jobs: Z | Errors: N
Updates appear every 5 seconds
Shows real-time crawl status

Report Results

When crawl completes, inform user:

Number of pages crawled
Duration in seconds or minutes
Path to results: ./output/results.json
Path to HTML report: ./output/index.html
Offer to open HTML report: open ./output/index.html

Natural Language Parsing

Example User Requests

Request: "Crawl docs.example.com" Parse: URL = https://docs.example.com, use defaults Command: rcrawler https://docs.example.com

Request: "Quick scan of blog.example.com" Parse: URL = blog.example.com, profile = fast Command: rcrawler https://blog.example.com -p fast

Request: "Deep crawl of api-docs.example.com with 40 workers" Parse: URL = api-docs.example.com, workers = 40, depth = deep Command: rcrawler https://api-docs.example.com -w 40 -d 5

Request: "Crawl example.com carefully, don't overload their server" Parse: URL = example.com, profile = gentle Command: rcrawler https://example.com -p gentle

Request: "Map the structure of help.example.com" Parse: URL = help.example.com, depth = moderate Command: rcrawler https://help.example.com -d 3

Error Handling

Binary Not Found

Check if binary exists

ls ~/.claude/skills/web-crawler/bin/rcrawler

If missing, build it

cd ~/.claude/skills/web-crawler/scripts && cargo build --release

Crawl Failures

Network errors:

Verify URL is accessible: curl -I <URL>
Check if site is down or blocking crawlers
Try with lower rate: -r 1

robots.txt blocking:

Crawler respects robots.txt by default
Check rules: curl <URL>/robots.txt
Inform user of restrictions

Timeout errors:

Increase timeout in code (default 30s)
Reduce workers: -w 10
Lower rate limit: -r 1

Too many errors:

Enable debug mode: --debug
Check specific failing URLs
May need to exclude certain patterns

Performance Benchmarks

Test: adonisjs.com

Pages: 450
Duration: 7.5 seconds
Throughput: 60 pages/sec
Workers: 20
Depth: 2

Test: rust-lang.org

Pages: 16
Duration: 3.9 seconds
Workers: 10
Depth: 1

Test: example.com

Pages: 2
Duration: 2.7 seconds
Workers: 5
Depth: 1

Technical Architecture

Core Components

CrawlEngine (src/crawler/engine.rs)

Worker pool management
Job queue coordination
Shutdown signaling
Statistics tracking

RobotsChecker (src/crawler/robots.rs)

Per-domain caching
Rule validation
Fallback on errors

RateLimiter (src/crawler/rate_limiter.rs)

Token bucket algorithm
Configurable rate
Shared across workers

UrlFilter (src/utils/filters.rs)

Regex-based filtering
Include/exclude patterns
Default exclusions

HtmlParser (src/parser/html.rs)

CSS selector queries
Title extraction
Link discovery

SitemapParser (src/parser/sitemap.rs)

XML parsing
Index traversal
URL extraction

Key Dependencies

tokio: Async runtime (multi-threaded)
reqwest: HTTP client (connection pooling)
scraper: HTML parsing (CSS selectors)
quick-xml: Sitemap parsing
governor: Rate limiting (token bucket)
tracing: Structured logging
dashmap: Concurrent HashMap
robotstxt: robots.txt compliance
clap: CLI argument parsing
serde/serde_json: Serialization

Tips & Best Practices

Start with Default Settings

First crawl should use defaults to understand site structure.

Use Profiles for Common Scenarios

fast: Quick overviews
deep: Comprehensive analysis
gentle: Respectful crawling

Monitor Progress

Watch the [Progress] lines to ensure crawl is progressing.

Check HTML Report

Interactive visualization helps understand site structure better than JSON.

Respect Rate Limits

Default 2 req/s is safe for most sites. Increase cautiously.

Enable Debug for Issues

--debug flag provides detailed logs for troubleshooting.

Review robots.txt

Check <URL>/robots.txt to understand crawling restrictions.

Use Custom Output for Multiple Crawls

Avoid overwriting results with -o flag.

Future Enhancements (V2.0)

Checkpoint resume: Full integration of checkpoint system
Per-domain rate limiting: Different rates for different domains
JavaScript rendering: chromiumoxide for dynamic sites
Distributed crawling: Redis-based job queue
Advanced analytics: SEO analysis, link quality scoring

Support & Resources

GitHub Repository: leobrival/rcrawler
Binary: ~/.claude/skills/web-crawler/bin/rcrawler
Skill Documentation: This file (SKILL.md)
Quick Start: README.md (this repository)
Development Guide: DEVELOPMENT.md

Version: 1.0.0 Status: Production Ready

web-crawler

Safety Notice

Copy this and send it to your AI assistant to learn

Clone the repository

Build release binary

Copy to skill directory

Check if binary exists

If missing, build it

Source Transparency

Related Skills

website-crawler

media-processor

image-processing

gif-creation