crawl4ai-web-crawler

Use Crawl4AI for web scraping and content extraction. Use when users need to scrape web content, extract structured data, convert web pages to Markdown, perform batch crawling, or use AI-driven web data collection.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "crawl4ai-web-crawler" with this command: npx skills add openlark/crawl4ai-web-crawler

Crawl4AI Web Crawler

Crawl4AI is an open-source, LLM-friendly web crawler on GitHub that converts web pages into clean Markdown or structured JSON, ideal for RAG, AI Agents, and data pipelines.

For detailed API parameters, see references/api-reference.md.

Trigger Words

"scrape," "crawl," "crawl," "extract webpage," "convert webpage to markdown," "structured extraction," etc.

Installation

pip install -U crawl4ai
crawl4ai-setup          # Automatically installs the Playwright browser
crawl4ai-doctor         # Verifies the installation

If the browser installation fails, run manually:

python -m playwright install --with-deps chromium

Core Architecture

Three core classes:

ClassPurpose
AsyncWebCrawlerMain async crawler class, manages the browser lifecycle
BrowserConfigBrowser settings (headless, UA, proxy, viewport, etc.)
CrawlerRunConfigPer-crawl settings (cache, extraction strategy, JS, screenshots, etc.)

Basic Usage

Simplest Crawl

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)  # LLM-ready Markdown

asyncio.run(main())

Crawl with Configuration

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

browser_cfg = BrowserConfig(headless=True, verbose=True)
run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,     # BYPASS=no cache, ENABLED=enable, WRITE_ONLY, READ_ONLY
    css_selector="main.article",     # Only extract the specified area
    word_count_threshold=10,         # Filter out short text blocks
    screenshot=True,                 # Take a screenshot
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun(url="https://example.com", config=run_cfg)
    print(result.markdown)
    if result.screenshot:
        print(f"Screenshot: {len(result.screenshot)} bytes base64")

Command Line Tool

# Basic crawl
crwl https://example.com -o markdown

# Deep crawl (BFS, up to 10 pages)
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# LLM extraction
crwl https://example.com/products -q "Extract all product prices"

Markdown Generation

Using Content Filters

Raw Markdown is generated by default. Use DefaultMarkdownGenerator + content filters to get cleaner output:

from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Method 1: PruningContentFilter — density-based pruning
md_gen = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(
        threshold=0.48,           # 0-1; the lower the value, the more is pruned
        threshold_type="fixed",   # "fixed" or "dynamic"
        min_word_threshold=0
    )
)

# Method 2: BM25ContentFilter — query relevance-based filtering
md_gen = DefaultMarkdownGenerator(
    content_filter=BM25ContentFilter(
        user_query="machine learning",  # Keywords to focus on
        bm25_threshold=1.0
    )
)

run_cfg = CrawlerRunConfig(markdown_generator=md_gen)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="...", config=run_cfg)
    print(len(result.markdown.raw_markdown))   # Raw MD
    print(len(result.markdown.fit_markdown))   # Filtered MD

Structured Data Extraction

CSS/XPath Extraction (No LLM Required, Fast and Free)

from crawl4ai import JsonCssExtractionStrategy
import json

schema = {
    "name": "Articles",
    "baseSelector": "article.post",     # Container for repeating elements
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
    ]
}

run_cfg = CrawlerRunConfig(
    extraction_strategy=JsonCssExtractionStrategy(schema)
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/blog", config=run_cfg)
    data = json.loads(result.extracted_content)
    print(data)  # [{"title": "...", "url": "...", "image": "..."}, ...]

Auto-Generate Schema (one-time LLM cost, then reuse for free):

from crawl4ai import LLMConfig

schema = JsonCssExtractionStrategy.generate_schema(
    html="<div class='product'>...",
    llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-key")
    # Or use a local model: LLMConfig(provider="ollama/llama3.3", api_token=None)
)

LLM Extraction (Suitable for Unstructured Content)

from pydantic import BaseModel, Field
from crawl4ai import LLMExtractionStrategy, LLMConfig

class Product(BaseModel):
    name: str = Field(..., description="Product name")
    price: str = Field(..., description="Price as string")
    description: str = Field(..., description="Short description")

llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",     # Also supports ollama/llama3, anthropic/claude-3, etc.
        api_token="your-api-key"
    ),
    schema=Product.model_json_schema(),
    extraction_type="schema",              # "schema" or "block"
    instruction="Extract all product objects with name, price, and description.",
    chunk_token_threshold=1000,            # Auto-chunk when exceeding this token count
    overlap_rate=0.1,                      # 10% overlap between chunks
    apply_chunking=True,
    input_format="markdown",               # "markdown" | "html" | "fit_markdown"
    extra_args={"temperature": 0.0, "max_tokens": 800}
)

run_cfg = CrawlerRunConfig(extraction_strategy=llm_strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/products", config=run_cfg)
    data = json.loads(result.extracted_content)
    llm_strategy.show_usage()  # Print token usage statistics

Extraction Strategy Selection Guide

ScenarioStrategy
Repeating lists (products, articles, search results)JsonCssExtractionStrategy
Unstructured text requiring AI understandingLLMExtractionStrategy
High-frequency crawling of the same siteGenerate Schema with LLM first, then extract via CSS

Dynamic Page Handling

run_cfg = CrawlerRunConfig(
    js_code=[                          # JS executed on the page
        "window.scrollTo(0, document.body.scrollHeight)",
        "await new Promise(r => setTimeout(r, 2000))",
    ],
    wait_for="css:.content-loaded",     # Wait for a specific element to appear
    delay_before_return_html=2.0,       # Additional wait in seconds before returning
)

Batch Crawling

urls = ["https://example.com/page1", "https://example.com/page2", ...]

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(urls=urls, config=run_cfg)
    for result in results:
        if result.success:
            print(result.markdown[:200])

arun_many() automatically handles rate limiting, memory monitoring, and concurrency control.

Browser Management

browser_cfg = BrowserConfig(
    browser_type="chromium",       # "chromium" | "firefox" | "webkit"
    headless=True,
    viewport_width=1920,
    viewport_height=1080,
    user_agent="Mozilla/5.0 ...",
    proxy="http://user:pass@proxy:8080",
    use_managed_browser=True,      # Use an existing browser instance
    user_data_dir="/path/to/profile",  # Persistent profile (to retain login state)
)

Deep Crawl (Site-Level Crawling)

from crawl4ai import DeepCrawlStrategy, BFSDeepCrawlStrategy

deep_crawl = BFSDeepCrawlStrategy(
    max_depth=3,                    # Maximum depth
    max_pages=50,                   # Maximum number of pages
    include_paths=["/docs/*"],      # Only crawl specified paths
    exclude_paths=["/blog/*"],      # Exclude specified paths
)

run_cfg = CrawlerRunConfig(deep_crawl_strategy=deep_crawl)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun(url="https://example.com", config=run_cfg)
    for r in results:
        print(f"{r.url} → {len(r.markdown)} chars")

Docker Deployment

docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest

# Dashboard: http://localhost:11235/dashboard
# Playground: http://localhost:11235/playground

Python Client:

import requests

resp = requests.post("http://localhost:11235/crawl",
    json={"urls": ["https://example.com"], "priority": 10})

task_id = resp.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
print(result.json())

CrawlResult Key Fields

result.url              # Final URL (after any redirects)
result.html             # Raw HTML
result.cleaned_html     # Cleaned HTML
result.markdown         # Markdown formatted output (contains raw_markdown and fit_markdown)
result.extracted_content # JSON string returned by the extraction strategy
result.screenshot       # Base64 screenshot
result.media            # Image/video information
result.links            # Internal and external link information
result.success          # Whether the crawl was successful
result.error_message    # Error message

FAQ

Playwright browser not installed:

python -m playwright install --with-deps chromium

Cache issues causing stale data to be returned: Set cache_mode=CacheMode.BYPASS to skip the cache.

Dynamic content not loading: Use wait_for="css:selector" to wait for the target element, or js_code to execute scrolling.

Out of memory (batch crawling): Reduce the concurrency level; arun_many() automatically monitors memory and adapts.

Anti-bot / detection: Enable use_managed_browser=True in BrowserConfig or configure a proxy.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Ai Work Session Planner

Turns a messy work goal into a focused 60 to 120 minute AI-assisted work session with inputs, checkpoints, and a clear done-state.

Registry SourceRecently Updated
General

Youtube Creator

Turn a 10-minute unedited vlog recording into 1080p polished YouTube videos just by typing what you need. Whether it's editing and formatting videos for YouT...

Registry SourceRecently Updated
General

基础目标检测技能

Detects people, vehicles, non-motorized vehicles, pets, and parcels appearing in the target area. Supports video stream and image detection, suitable for gen...

Registry SourceRecently Updated
General

skill-enhance

Create, enhance, harden, and prepare skills for release. Use this skill whenever the user wants to build a new skill, strengthen an existing skill, improve t...

Registry SourceRecently Updated