crawl4ai-docker-skill

Dockerized web crawling and scraping service with REST API. Docker化网页爬虫服务 | Web crawler, web scraper, REST API service. Intelligent content extraction with LLM optimization. 智能内容提取 | Docker部署,REST API调用

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "crawl4ai-docker-skill" with this command: npx skills add orange-afk/crawl4ai-docker-skill

Crawl4AI Docker Skill - Web Crawler & Scraper Service

Dockerized Web Crawling 网页爬虫服务 | REST API 网页爬取 | LLM 智能提取

基于 Docker 部署的 Crawl4AI 网页爬虫服务,提供完整的 REST API 接口,支持智能内容提取和 LLM 优化输出。

🚀 核心功能 | Core Features

  • 🐳 Docker 部署 - 容器化服务,端口 11235
  • 🔌 REST API - 完整的 HTTP 接口
  • 🤖 LLM 智能提取 - 支持多种 LLM 提供商
  • 📊 实时监控 - 内置监控面板和 API
  • 高性能 - 异步处理,支持并发请求

📋 快速开始 | Quick Start

前提条件 | Prerequisites

确保 Docker Compose 服务正在运行:

# 检查服务状态
docker compose ps

# 健康检查
curl http://localhost:11235/health

# 访问监控面板
open http://localhost:11235/dashboard

🔌 REST API 使用指南

基础网页抓取 | Basic Web Crawling

简单 Markdown 提取

curl -X POST "http://localhost:11235/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "extraction_strategy": "markdown"
  }'

带浏览器配置

curl -X POST "http://localhost:11235/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "extraction_strategy": "markdown",
    "browser_config": {
      "headless": true,
      "viewport_width": 1280,
      "viewport_height": 720
    }
  }'

LLM 智能提取 | LLM Smart Extraction

内容总结

curl -X POST "http://localhost:11235/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "extraction_strategy": {
      "type": "llm",
      "provider": "openrouter/free",
      "instruction": "总结网页的主要内容",
      "max_tokens": 1000
    }
  }'

结构化数据提取

curl -X POST "http://localhost:11235/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/products"],
    "extraction_strategy": {
      "type": "llm",
      "provider": "openrouter/free",
      "instruction": "提取所有产品名称、价格和描述,返回 JSON 格式",
      "max_tokens": 1500,
      "temperature": 0.1
    }
  }'

高级功能 | Advanced Features

网页截图

curl -X POST "http://localhost:11235/screenshot" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "options": {
      "full_page": true,
      "quality": 80
    }
  }'

PDF 生成

curl -X POST "http://localhost:11235/pdf" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com"
  }'

📊 API 端点参考 | API Endpoints Reference

核心端点 | Core Endpoints

端点方法用途
POST /crawlPOST网页抓取和内容提取
GET /healthGET服务健康检查
GET /dashboardGET监控面板

监控端点 | Monitoring Endpoints

端点方法用途
GET /monitor/healthGET系统健康状态
GET /monitor/browsersGET浏览器池状态
GET /monitor/requestsGET请求统计

工具端点 | Utility Endpoints

端点方法用途
POST /screenshotPOST网页截图
POST /pdfPOSTPDF 生成
POST /execute_jsPOSTJavaScript 执行

🎯 使用场景 | Use Cases

场景 1:文档网站爬取 | Documentation Site Crawling

curl -X POST "http://localhost:11235/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://docs.openclaw.ai/zh-CN"],
    "extraction_strategy": "markdown"
  }'

场景 2:新闻文章提取 | News Article Extraction

curl -X POST "http://localhost:11235/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://news-site.com/article"],
    "extraction_strategy": {
      "type": "llm",
      "provider": "openrouter/free",
      "instruction": "提取文章标题、作者、发布时间和主要内容",
      "max_tokens": 1500
    }
  }'

场景 3:产品信息抓取 | Product Information Scraping

curl -X POST "http://localhost:11235/crawl" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://ecommerce-site.com/products"],
    "extraction_strategy": {
      "type": "llm",
      "provider": "openrouter/free",
      "instruction": "提取所有产品的名称、价格、描述和图片链接",
      "max_tokens": 2000
    }
  }'

⚙️ 配置说明 | Configuration

LLM 提供商配置 | LLM Provider Configuration

创建 .llm.env 文件:

# OpenRouter 配置
OPENROUTER_API_KEY=your-api-key
LLM_PROVIDER=openrouter/free
LLM_MAX_TOKENS=2000
LLM_TEMPERATURE=0.7

# 或使用其他提供商
# OPENAI_API_KEY=sk-your-key
# OPENAI_BASE_URL=https://your-custom-api.com/v1
# LLM_PROVIDER=openai/gpt-4o-mini

浏览器配置 | Browser Configuration

{
  "browser_config": {
    "headless": true,
    "viewport_width": 1280,
    "viewport_height": 720,
    "user_agent": "Mozilla/5.0..."
  }
}

📈 响应格式 | Response Format

成功响应 | Success Response

{
  "success": true,
  "results": [
    {
      "url": "https://example.com",
      "markdown": "# 提取的 Markdown 内容...",
      "metadata": {
        "title": "网页标题",
        "description": "网页描述",
        "url": "https://example.com"
      },
      "extracted_content": {
        "summary": "LLM 提取的内容..."
      }
    }
  ]
}

错误响应 | Error Response

{
  "success": false,
  "error": "错误信息",
  "code": "ERROR_CODE"
}

🔧 故障排除 | Troubleshooting

常见问题 | Common Issues

1. 服务未启动

# 检查容器状态
docker compose ps

# 查看日志
docker compose logs crawl4ai

# 重启服务
docker compose restart crawl4ai

2. LLM 提取失败

  • 检查 .llm.env 配置
  • 验证 API 密钥
  • 测试不同的 LLM 提供商

3. 网络连接问题

# 测试网络连接
curl -I https://example.com

# 检查代理配置
env | grep -i proxy

监控和调试 | Monitoring & Debugging

# 访问监控面板
open http://localhost:11235/dashboard

# 查看系统健康
curl http://localhost:11235/monitor/health

# 查看浏览器池状态
curl http://localhost:11235/monitor/browsers

🔗 相关链接 | Links


🎉 为什么选择 Docker 版本?

容器化部署 - 一键启动,环境隔离
REST API - 标准 HTTP 接口,易于集成
实时监控 - 内置监控面板和 API
资源管理 - 自动浏览器池管理
生产就绪 - 企业级稳定性和性能

立即开始使用 Docker 化的 Crawl4AI 服务! 🚀

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

XCrawl Scraper

XCrawl - AI-Powered Web Scraping API / AI 驱动网页爬虫,支持结构化数据提取

Registry Source
3990Profile unavailable
General

Scrapeless LLM Chat Scraper Skill

Scrape AI chat conversations from ChatGPT, Gemini, Perplexity, Copilot, Google AI Mode, and Grok.

Registry SourceRecently Updated
1840Profile unavailable
General

Instaparser

Use the Instaparser API to parse articles, PDFs, and generate summaries from URLs. Trigger when users want to extract content from web pages, parse PDF docum...

Registry Source
1411Profile unavailable
General

Web Scraper - Firecrawl

Web scraping and content extraction using Firecrawl API. Use when users need to crawl websites, extract structured data, convert web pages to markdown, scrap...

Registry Source
1460Profile unavailable