Web Content Extractor - 网页内容提取器

版本: 2.0
作者: OpenClaw Team
更新日期: 2026-03-15
许可证: MIT

📦 技能元数据

name: web-content-extractor
version: 2.0.0
description: 从微信文章/博客/新闻网页提取干净内容，去除广告和侧边栏
category: 内容处理
tags: [网页提取，内容清洗，微信文章，Markdown]
author: OpenClaw Team
license: MIT

🎯 功能概述

基于 Readability + Firecrawl + Defuddle 三引擎的网页内容提取工具，专为中文内容优化。支持微信文章、新闻网站、博客等多种来源，自动去除广告/导航/侧边栏，输出干净的 Markdown 格式。

核心能力：

✅ 微信文章提取（mp.weixin.qq.com）
✅ 新闻网页清洗
✅ 博客文章解析
✅ 元数据提取（标题/作者/日期）
✅ 多格式输出（Markdown/JSON/纯文本）
✅ 批量处理支持

🚀 快速开始

基础调用

# OpenClaw 工具调用
result = web_fetch(
    url="https://mp.weixin.qq.com/s/xxx",
    extractMode="markdown",
    maxChars=8000
)

完整参数

参数	类型	必填	默认值	说明
url	str	✅	-	网页 URL
extractMode	str	❌	"markdown"	输出格式（markdown/text/json）
maxChars	int	❌	8000	最大字符数
includeMetadata	bool	❌	true	是否包含元数据
timeout	int	❌	30	超时时间（秒）

📤 输入输出

输入示例

{
  "url": "https://mp.weixin.qq.com/s/abcdefg",
  "extractMode": "markdown",
  "maxChars": 8000,
  "includeMetadata": true
}

输出示例

{
  "success": true,
  "url": "https://mp.weixin.qq.com/s/abcdefg",
  "title": "文章标题",
  "author": "作者名",
  "publishDate": "2026-03-15",
  "content": "Markdown 格式的正文内容...",
  "wordCount": 2500,
  "readTime": "10 分钟",
  "images": ["https://..."],
  "extractTime": 0.8
}

🔧 技术架构

三引擎设计

                    用户请求
                       ↓
              ┌────────────────┐
              │   路由判断层    │
              └────────────────┘
                       ↓
        ┌──────────────┼──────────────┐
        ↓              ↓              ↓
   ┌─────────┐   ┌─────────┐   ┌─────────┐
   │ web_fetch│   │ defuddle│   │ browser │
   │ (快速)  │   │ (专业)  │   │ (兜底)  │
   └─────────┘   └─────────┘   └─────────┘
        ↓              ↓              ↓
              ┌────────────────┐
              │   结果聚合层    │
              └────────────────┘
                       ↓
                  返回用户

引擎对比

引擎	速度	成功率	适用场景
web_fetch	<1s	70%	微信文章/通用网页
defuddle	<1s	75%	博客/新闻网站
browser	5-10s	90%	复杂 SPA/动态页面

📋 使用场景

场景 1：微信文章提取

result = web_fetch(
    url="https://mp.weixin.qq.com/s/xxx",
    extractMode="markdown"
)
print(result["content"])

场景 2：批量处理

urls = ["url1", "url2", "url3"]
results = [web_fetch(url=u) for u in urls]

场景 3：带元数据提取

result = web_fetch(
    url="https://example.com/article",
    includeMetadata=True
)
print(f"标题：{result['title']}")
print(f"作者：{result['author']}")
print(f"字数：{result['wordCount']}")

⚠️ 限制与注意事项

不支持的场景

❌ 需要登录的页面
❌ 付费墙内容
❌ 验证码保护的页面
❌ 纯 JavaScript 渲染的 SPA（需用 browser 引擎）

速率限制

域名类型	请求间隔	并发限制
微信文章	2 秒	1
新闻网站	1 秒	3
博客	1 秒	5

合规要求

仅提取公开可访问内容
尊重 robots.txt 协议
不用于商业用途（除非获得授权）
保留原作者署名

🎛️ 高级配置

自定义 User-Agent

result = web_fetch(
    url="https://example.com",
    userAgent="Mozilla/5.0 ..."
)

代理配置

result = web_fetch(
    url="https://example.com",
    proxy="http://proxy:port"
)

缓存控制

# 启用缓存（1 小时）
result = web_fetch(url, cache=True, ttl=3600)

# 强制刷新
result = web_fetch(url, cache=False)

📊 性能指标

指标	数值
平均响应时间	0.8 秒
P95 响应时间	2.5 秒
成功率	85%
缓存命中率	60%

🔍 故障排查

问题 1：提取内容为空

原因：页面需要 JavaScript 渲染
解决：切换到 browser 引擎

问题 2：微信文章提取失败

原因：链接过期或有反爬
解决：

检查链接是否有效
尝试 browser 引擎
手动复制内容

问题 3：提取内容不完整

原因：maxChars 限制
解决：增加 maxChars 参数或分页处理

📚 依赖项

{
  "readability": "^0.4.4",
  "firecrawl": "^1.0.0",
  "defuddle": "^3.0.0"
}

🤝 贡献指南

Fork 本仓库
创建功能分支 (git checkout -b feature/AmazingFeature)
提交更改 (git commit -m 'Add some AmazingFeature')
推送到分支 (git push origin feature/AmazingFeature)
开启 Pull Request

📄 许可证

MIT License - 详见 LICENSE

📞 支持

文档: https://docs.openclaw.ai/skills/web-content-extractor
问题反馈: https://github.com/openclaw/openclaw/issues
社区: https://discord.com/invite/clawd

最后更新: 2026-03-15
维护状态: ✅ 活跃维护