smart-crawler

智能爬虫工具 - 企业级数据采集与反爬虫处理 | Smart Web Crawler - Enterprise data collection with anti-detection

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "smart-crawler" with this command: npx skills add kaiyuelv/smart-crawler

Smart Crawler - 智能爬虫工具

企业级数据采集解决方案，支持智能反爬虫处理、分布式爬取和数据清洗。

核心功能

功能模块	说明
智能爬虫引擎	基于 Playwright/Selenium 的动态渲染爬取
反爬虫处理	自动切换 User-Agent、代理池、请求频率控制
数据提取	XPath/CSS Selector/Regex 多模式数据提取
分布式支持	Redis 队列支持的分布式爬取
数据清洗	自动去重、格式标准化、敏感信息过滤

快速开始

from scripts.crawler_engine import CrawlerEngine

# 创建爬虫引擎
crawler = CrawlerEngine(use_proxy=True, headless=True)

# 爬取网页
result = crawler.crawl('https://example.com', 
                       extract_rules={'title': '//h1/text()',
                                     'content': '//div[@class="content"]//p/text()'})
print(result)

安装

pip install -r requirements.txt
playwright install

项目结构

smart-crawler/
├── SKILL.md                 # Skill说明文档
├── README.md                # 完整文档
├── requirements.txt         # 依赖列表
├── scripts/                 # 核心模块
│   ├── crawler_engine.py    # 爬虫引擎
│   ├── proxy_manager.py     # 代理管理器
│   ├── data_extractor.py    # 数据提取器
│   └── anti_detection.py    # 反检测模块
├── examples/                # 使用示例
│   └── basic_usage.py
└── tests/                   # 单元测试
    └── test_crawler.py

运行测试

cd tests
python test_crawler.py

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open Registry Record Open in ClawHub

Related Skills

Related by shared tags or category signals.

Automation

AutoClaw Browser Automation

Complete browser automation skill with MCP protocol support and Chrome extension

Registry SourceRecently Updated

1K0Profile unavailable

Coding

Dexter Browser Automation

Browser automation via Playwright. Use when pages are JS-rendered, require user interaction (click/fill), or need screenshots. Part of the 3-layer web tool h...

Registry SourceRecently Updated

770Profile unavailable

Automation

AgentGo Cloud Browser

Automates browser interactions using AgentGo's distributed cloud browser cluster via playwright@1.51.0. Use when the user needs to navigate websites, interac...

Registry Source

3640Profile unavailable

Coding

Scrapling

Web scraping and data extraction using the Python Scrapling library. Use to scrape static HTML pages, JavaScript-rendered pages (Playwright), and anti-bot or...

Registry Source

5921Profile unavailable