web-scraping

Extract structured information from websites using web_fetch for simple pages and browser automation for dynamic sites, login-gated flows, pagination, infinite scroll, or pages that require clicking/searching/filtering. Use when the user asks to scrape a site, collect listings, gather fields from many pages, monitor website changes, or turn webpage content into structured summaries/JSON/CSV.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-scraping" with this command: npx skills add zhangqixin9527/web-scraping

Web Scraping

Extract data with the lightest reliable method first.

Choose the approach

  1. Use web_fetch for simple public pages when the needed content is already in HTML.
  2. Use browser when the site is dynamic, needs clicking, infinite scroll, filters, tabs, or login/session state.
  3. Use web_search only to discover candidate pages when the target URL is unknown.

Default workflow

  1. Identify the target site and exact fields to collect.
  2. Test one page first.
  3. Decide the extraction method:
    • web_fetch for readable article/listing text
    • browser snapshot for dynamic DOM inspection
  4. Normalize the output into a stable schema.
  5. If scraping multiple pages, avoid tight loops and serialize requests.
  6. Deduplicate by URL or stable item id.
  7. Save results in the workspace when the task is larger than a quick one-off.

Browser scraping pattern

  1. Open the page.
  2. Take a snapshot.
  3. Interact only as needed: search, click filters, pagination, expand sections.
  4. Re-snapshot after each meaningful state change.
  5. Extract only the fields the user asked for.
  6. Close tabs when finished.

Output guidance

Prefer one of these formats:

  • concise bullet summary
  • JSON array of objects
  • CSV/TSV when the user wants exportable rows

Use explicit keys, for example:

[
  {
    "title": "...",
    "url": "...",
    "source": "...",
    "date": "...",
    "summary": "..."
  }
]

Reliability rules

  • Do not invent missing fields.
  • If a site blocks access, say so and switch sources when appropriate.
  • For news/results pages, capture source + title + link at minimum.
  • For large jobs, checkpoint partial results to a workspace file.
  • Prefer fewer larger writes over many tiny writes.

Cleanup

  • Close browser tabs opened for scraping.
  • If you create state/output files, store them under the workspace and name them clearly.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Harbor Skills

Harbor 镜像仓库综合管理技能。用于 Harbor 日常运维、项目与镜像管理、安全扫描、清理策略、CI/CD 集成、GitOps、复制规则、存储管理、备份恢复、webhook 联动等所有 Harbor 相关操作。当用户提到 Harbor、镜像仓库管理、Docker 镜像、镜像安全扫描、CI/CD 镜像推送/拉...

Registry SourceRecently Updated
Automation

Dynamics Crm

Microsoft Dynamics 365 integration. Manage crm and sales data, records, and workflows. Use when the user wants to interact with Microsoft Dynamics 365 data.

Registry SourceRecently Updated
Automation

Jira

Jira integration. Manage project management and ticketing data, records, and workflows. Use when the user wants to interact with Jira data.

Registry SourceRecently Updated
Automation

Generate Education Ad Creative Brief

Plan campaign visuals and hooks for education promotions. Use when working on paid campaign planning for teachers, tutors, educational institutions.

Registry SourceRecently Updated