Crawl4AI Web Crawler
Crawl4AI is an open-source, LLM-friendly web crawler on GitHub that converts web pages into clean Markdown or structured JSON, ideal for RAG, AI Agents, and data pipelines.
For detailed API parameters, see references/api-reference.md.
Trigger Words
"scrape," "crawl," "crawl," "extract webpage," "convert webpage to markdown," "structured extraction," etc.
Installation
pip install -U crawl4ai
crawl4ai-setup # Automatically installs the Playwright browser
crawl4ai-doctor # Verifies the installation
If the browser installation fails, run manually:
python -m playwright install --with-deps chromium
Core Architecture
Three core classes:
| Class | Purpose |
|---|---|
AsyncWebCrawler | Main async crawler class, manages the browser lifecycle |
BrowserConfig | Browser settings (headless, UA, proxy, viewport, etc.) |
CrawlerRunConfig | Per-crawl settings (cache, extraction strategy, JS, screenshots, etc.) |
Basic Usage
Simplest Crawl
import asyncio
from crawl4ai import AsyncWebCrawler
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com")
print(result.markdown) # LLM-ready Markdown
asyncio.run(main())
Crawl with Configuration
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode
browser_cfg = BrowserConfig(headless=True, verbose=True)
run_cfg = CrawlerRunConfig(
cache_mode=CacheMode.BYPASS, # BYPASS=no cache, ENABLED=enable, WRITE_ONLY, READ_ONLY
css_selector="main.article", # Only extract the specified area
word_count_threshold=10, # Filter out short text blocks
screenshot=True, # Take a screenshot
)
async with AsyncWebCrawler(config=browser_cfg) as crawler:
result = await crawler.arun(url="https://example.com", config=run_cfg)
print(result.markdown)
if result.screenshot:
print(f"Screenshot: {len(result.screenshot)} bytes base64")
Command Line Tool
# Basic crawl
crwl https://example.com -o markdown
# Deep crawl (BFS, up to 10 pages)
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10
# LLM extraction
crwl https://example.com/products -q "Extract all product prices"
Markdown Generation
Using Content Filters
Raw Markdown is generated by default. Use DefaultMarkdownGenerator + content filters to get cleaner output:
from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator
# Method 1: PruningContentFilter — density-based pruning
md_gen = DefaultMarkdownGenerator(
content_filter=PruningContentFilter(
threshold=0.48, # 0-1; the lower the value, the more is pruned
threshold_type="fixed", # "fixed" or "dynamic"
min_word_threshold=0
)
)
# Method 2: BM25ContentFilter — query relevance-based filtering
md_gen = DefaultMarkdownGenerator(
content_filter=BM25ContentFilter(
user_query="machine learning", # Keywords to focus on
bm25_threshold=1.0
)
)
run_cfg = CrawlerRunConfig(markdown_generator=md_gen)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="...", config=run_cfg)
print(len(result.markdown.raw_markdown)) # Raw MD
print(len(result.markdown.fit_markdown)) # Filtered MD
Structured Data Extraction
CSS/XPath Extraction (No LLM Required, Fast and Free)
from crawl4ai import JsonCssExtractionStrategy
import json
schema = {
"name": "Articles",
"baseSelector": "article.post", # Container for repeating elements
"fields": [
{"name": "title", "selector": "h2", "type": "text"},
{"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
{"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
]
}
run_cfg = CrawlerRunConfig(
extraction_strategy=JsonCssExtractionStrategy(schema)
)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com/blog", config=run_cfg)
data = json.loads(result.extracted_content)
print(data) # [{"title": "...", "url": "...", "image": "..."}, ...]
Auto-Generate Schema (one-time LLM cost, then reuse for free):
from crawl4ai import LLMConfig
schema = JsonCssExtractionStrategy.generate_schema(
html="<div class='product'>...",
llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-key")
# Or use a local model: LLMConfig(provider="ollama/llama3.3", api_token=None)
)
LLM Extraction (Suitable for Unstructured Content)
from pydantic import BaseModel, Field
from crawl4ai import LLMExtractionStrategy, LLMConfig
class Product(BaseModel):
name: str = Field(..., description="Product name")
price: str = Field(..., description="Price as string")
description: str = Field(..., description="Short description")
llm_strategy = LLMExtractionStrategy(
llm_config=LLMConfig(
provider="openai/gpt-4o-mini", # Also supports ollama/llama3, anthropic/claude-3, etc.
api_token="your-api-key"
),
schema=Product.model_json_schema(),
extraction_type="schema", # "schema" or "block"
instruction="Extract all product objects with name, price, and description.",
chunk_token_threshold=1000, # Auto-chunk when exceeding this token count
overlap_rate=0.1, # 10% overlap between chunks
apply_chunking=True,
input_format="markdown", # "markdown" | "html" | "fit_markdown"
extra_args={"temperature": 0.0, "max_tokens": 800}
)
run_cfg = CrawlerRunConfig(extraction_strategy=llm_strategy)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url="https://example.com/products", config=run_cfg)
data = json.loads(result.extracted_content)
llm_strategy.show_usage() # Print token usage statistics
Extraction Strategy Selection Guide
| Scenario | Strategy |
|---|---|
| Repeating lists (products, articles, search results) | JsonCssExtractionStrategy |
| Unstructured text requiring AI understanding | LLMExtractionStrategy |
| High-frequency crawling of the same site | Generate Schema with LLM first, then extract via CSS |
Dynamic Page Handling
run_cfg = CrawlerRunConfig(
js_code=[ # JS executed on the page
"window.scrollTo(0, document.body.scrollHeight)",
"await new Promise(r => setTimeout(r, 2000))",
],
wait_for="css:.content-loaded", # Wait for a specific element to appear
delay_before_return_html=2.0, # Additional wait in seconds before returning
)
Batch Crawling
urls = ["https://example.com/page1", "https://example.com/page2", ...]
async with AsyncWebCrawler() as crawler:
results = await crawler.arun_many(urls=urls, config=run_cfg)
for result in results:
if result.success:
print(result.markdown[:200])
arun_many() automatically handles rate limiting, memory monitoring, and concurrency control.
Browser Management
browser_cfg = BrowserConfig(
browser_type="chromium", # "chromium" | "firefox" | "webkit"
headless=True,
viewport_width=1920,
viewport_height=1080,
user_agent="Mozilla/5.0 ...",
proxy="http://user:pass@proxy:8080",
use_managed_browser=True, # Use an existing browser instance
user_data_dir="/path/to/profile", # Persistent profile (to retain login state)
)
Deep Crawl (Site-Level Crawling)
from crawl4ai import DeepCrawlStrategy, BFSDeepCrawlStrategy
deep_crawl = BFSDeepCrawlStrategy(
max_depth=3, # Maximum depth
max_pages=50, # Maximum number of pages
include_paths=["/docs/*"], # Only crawl specified paths
exclude_paths=["/blog/*"], # Exclude specified paths
)
run_cfg = CrawlerRunConfig(deep_crawl_strategy=deep_crawl)
async with AsyncWebCrawler() as crawler:
results = await crawler.arun(url="https://example.com", config=run_cfg)
for r in results:
print(f"{r.url} → {len(r.markdown)} chars")
Docker Deployment
docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest
# Dashboard: http://localhost:11235/dashboard
# Playground: http://localhost:11235/playground
Python Client:
import requests
resp = requests.post("http://localhost:11235/crawl",
json={"urls": ["https://example.com"], "priority": 10})
task_id = resp.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
print(result.json())
CrawlResult Key Fields
result.url # Final URL (after any redirects)
result.html # Raw HTML
result.cleaned_html # Cleaned HTML
result.markdown # Markdown formatted output (contains raw_markdown and fit_markdown)
result.extracted_content # JSON string returned by the extraction strategy
result.screenshot # Base64 screenshot
result.media # Image/video information
result.links # Internal and external link information
result.success # Whether the crawl was successful
result.error_message # Error message
FAQ
Playwright browser not installed:
python -m playwright install --with-deps chromium
Cache issues causing stale data to be returned:
Set cache_mode=CacheMode.BYPASS to skip the cache.
Dynamic content not loading:
Use wait_for="css:selector" to wait for the target element, or js_code to execute scrolling.
Out of memory (batch crawling):
Reduce the concurrency level; arun_many() automatically monitors memory and adapts.
Anti-bot / detection:
Enable use_managed_browser=True in BrowserConfig or configure a proxy.