datacrawl-debug

Use when user needs to process web data, debug data collection code, clean processed data, or iterate on data processing strategies. Use when generating data processing code from URL and field descriptions. Use when diagnosing data processing errors like 403, timeout, selector failures, encoding issues. Use when cleaning, deduplicating, normalizing, and formatting processed data. Use when optimizing data processing strategies based on run history analysis. Use when user mentions "数据处理", "数据整理", "数据清洗", "数据代码", "数据调试", "data processing", "data extraction", "debug data".

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "datacrawl-debug" with this command: npx skills add wangm-a3/datacrawl-debug

DataProcess Debug — 数据处理全流程工具

处理得了·修得好·洗得净·跑得稳

核心定位

数据处理的"急诊室+健身房"——出了问题来急诊(DebugRunner),日常训练来健身(IterateOptimizer),全程配营养师(DataCleaner)。

5大核心模块

1. ProcessEngine — 数据处理配置生成 + 结果解析

scripts/process-engine.py config --url URL --fields 字段1 字段2 --mode static|dynamic|api
scripts/process-engine.py extract --html "HTML内容" --fields 字段1 字段2
  • 站点类型自动识别(电商/B2B/社媒/内容/政府/开发者)
  • 3种模式工具推荐 + CSS/XPath选择器建议
  • HTML结构化提取(文本/链接/图片/表格/列表)

2. CodeGenerator — 数据处理代码自动生成

scripts/code-generator.py --name 项目名 --url URL --fields 字段1 字段2 --mode requests_bs4|playwright|api_client
  • 3种模板自动选择:静态页面/动态渲染/API接口
  • 生成完整可运行代码 + 依赖安装 + 使用步骤

3. DebugRunner — 代码调试与修复

scripts/debug-runner.py --error "错误信息"
  • 8类错误模式库:connection/http_error/timeout/selector_error/encoding/json_parse/selenium_playwright/rate_limit
  • HTTP子类型精准诊断(403限流/429限流/503服务不可用等各有方案)
  • 代码片段扫描(缺异常处理/超时/延迟/UA自动检测)

4. DataCleaner — 数据清洗格式化

scripts/data-cleaner.py clean --input 数据 --remove-html --remove-duplicates
scripts/data-cleaner.py normalize --input 数据 --schema 类型定义
scripts/data-cleaner.py format --input 数据 --format json|csv|jsonl --fields 字段列表

5. IterateOptimizer — 自我迭代优化

scripts/iterate-optimizer.py analyze --input 运行历史.json
scripts/iterate-optimizer.py improve --config 当前配置 --analysis 分析结果
  • 成功率趋势 / 错误聚类 / 字段覆盖率 / 优化建议
  • 自动调整延迟/超时/重试/模式切换

实战案例:外贸博主数据处理

内置 scripts/trade-contact-scorer.py

  • 5维粉丝质量评分(互动率/收藏比/评论活跃/粉丝规模/外贸相关度)
  • S/A/B/C/D 5级分层
  • 粉丝画像推断(工厂主/跨境卖家/SOHO/公司经营者/新手)
  • 批量数据处理(去重+外贸筛选+评分+画像)

常见处理问题诊断

直接请求API → 必遇限制。正确方案:

  1. 用Playwright打开网页版
  2. 手动登录后保存Cookie
  3. 通过搜索页面提取数据
  4. 用本技能的评分模型替代简单加权

使用流程

  1. 配置: process-engine.py config → 了解目标站点+推荐方案
  2. 生成代码: code-generator.py → 获得起始代码模板
  3. 调试: 遇错 → debug-runner.py → 秒级诊断
  4. 清洗: data-cleaner.py → 去重+标准化+格式化
  5. 迭代: iterate-optimizer.py → 基于运行数据持续改进

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

browser-auto-download

Browser-automated file download with enhanced features. Auto-detects platform (Windows/macOS/Linux, 64/32-bit, ARM/Intel), handles multi-step navigation (homepage to platform-specific pages), captures auto-downloads triggered on page load, and falls back to button clicking when needed. Ideal for complex download flows where curl/wget fail due to client-side rendering, automatic downloads, or multi-page navigation. Features page scrolling for lazy content, extended wait times, and Golang support.

Registry SourceRecently Updated
2.1K1aaronxx
Coding

The Flip Publish

$1 USDC entry. 14 coin flips. Get all 14 right, take the entire jackpot. Live on Solana devnet — continuous game, enter anytime.

Registry SourceRecently Updated
Coding

AgentOS SDK

AgentOS SDK provides APIs and CLI tools for persistent AI agent memory, project and task management, activity logging, inter-agent communication, and self-ev...

Registry SourceRecently Updated
Coding

Openclaw Godmode Skill Repo

Self-orchestrating multi-agent development workflows. You say WHAT, the AI decides HOW.

Registry SourceRecently Updated
4.6K12cubetribe