Deep Web Fetcher

# Skill: Deep Web Fetcher

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Deep Web Fetcher" with this command: npx skills add xueylee-dotcom/deep-web-fetcher

Skill: Deep Web Fetcher

版本:1.0.0
描述:免费网页抓取 + 内容提取 + 结构化输出,无需付费API


核心功能

  • 网页抓取:支持JS渲染,自动等待页面加载
  • 正文提取:智能识别文章主体,过滤广告/导航
  • 元数据提取:自动提取标题、作者、发布时间
  • 指标提取:从正文提取关键数据(样本量、AUC、成本等)

触发命令

/web-fetcher <url> [--domain <领域>]

参数说明

参数默认值说明
url必填目标网页URL
--domaingeneral研究领域,影响指标提取规则

领域选项

  • general:通用提取
  • healthcare:医疗/健康领域
  • medical:医学研究
  • insurance:保险控费
  • machine_learning:机器学习

执行流程

1. 启动Playwright浏览器
2. 访问目标URL,等待JS渲染完成
3. 使用Readability提取正文
4. 提取元数据(标题、作者、时间)
5. 根据领域规则提取关键指标
6. 输出生成JSON

输出格式

{
  "url": "https://example.com/article",
  "success": true,
  "title": "文章标题",
  "author": "作者名",
  "published_date": "2024-01-15",
  "content_text": "正文内容...",
  "content_html": "<html>...</html>",
  "word_count": 1500,
  "extracted_metrics": {
    "sample_size": "9,080",
    "auc": 0.85,
    "accuracy": 92.5
  },
  "error": null
}

使用示例

抓取arXiv论文

/web-fetcher "https://arxiv.org/abs/2301.12345" --domain "machine learning"

抓取PubMed摘要

/web-fetcher "https://pubmed.ncbi.nlm.nih.gov/38134648/" --domain "medical"

抓取政府报告

/web-fetcher "https://www.gov.cn/zhengce/zhengceku/2024-01/15/content_6923456.htm" --domain "insurance"

依赖安装

# 安装Python依赖
pip install playwright readability-lxml lxml beautifulsoup4

# 安装浏览器驱动(首次运行需下载~100MB)
playwright install chromium

注意事项

反爬策略

部分网站有反爬机制,如遇失败可:

  1. 增加延迟:在脚本中调整 time.sleep()
  2. 使用代理:在 browser.new_context() 中添加代理
  3. 轮换UA:修改 user_agent 参数

提取准确率

  • 标准网页(文章/博客):✅ 效果优秀
  • 复杂布局(多栏/动态加载):⚠️ 可能需人工复核
  • PDF页面:❌ 不支持,请用PDF专用工具

执行速度

  • 单页抓取:5-15秒(含浏览器启动)
  • 批量抓取:建议并发3-5个

与深度研究v6.0集成

# 生成卡片
/web-fetcher <url> --domain "insurance" > sources/card-xxx.json

# 转换卡片格式
python3 scripts/convert-to-card.py sources/card-xxx.json

文件结构

skills/web-fetcher/
├── SKILL.md
└── scripts/
    └── web-fetcher.py

版本历史

版本日期更新
1.0.02026-03-19初始版本

完全免费,本地运行,数据不出机器

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

Youcom Search

you.com web search, deep research, and content extraction for OpenClaw. Free tier for basic search; research and extract require paid API key.

Registry SourceRecently Updated
1320Profile unavailable
Research

Perplexity Search Skill

Search the web with Perplexity Sonar API for current information, citations, and web-grounded answers.

Registry Source
2750Profile unavailable
General

Jina Web Fetcher - 网页抓取

使用 Jina AI 抓取网页内容,绕过搜索引擎限制。支持任意URL,支持 Google 搜索结果抓取。

Registry SourceRecently Updated
1.6K0Profile unavailable
General

Ddg Search Fetch

Search the web and fetch URL content using DuckDuckGo. Use when the user wants to search for information online without requiring API keys or paid services....

Registry SourceRecently Updated
2190Profile unavailable