Playwright_OCR

Automated web data extraction using Playwright for browser automation and OCR for text recognition. Use when you need to extract data from dynamic web pages, charts, or visual elements that require both browser automation and optical character recognition. v2.0: Added batch processing, data validation, and error recovery.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "Playwright_OCR" with this command: npx skills add cgxxxxxxxxxxxx/playwright-ocr

When to Use

  • Extract data from dynamic web pages with JavaScript-rendered content
  • Scrape charts, graphs, or visual data representations
  • Capture and process screenshots for text extraction
  • Automate data collection from web dashboards
  • Extract data from pages requiring authentication or interaction

Architecture

playwright_ocr/
├── SKILL.md              # This file
├── scripts/
│   ├── extract_data.js   # Playwright browser automation
│   ├── process_ocr.py    # OCR text recognition
│   └── upload_csv.py     # Data export and upload
└── output/
    └── extracted_data.csv # Extracted data

Configuration

Prerequisites

  1. Node.js (for Playwright)
npm install playwright
npx playwright install chromium
  1. Python (for OCR)
pip3 install pytesseract pillow
apt-get install tesseract-ocr  # Linux

API Keys (Optional)

  • Feishu API: For uploading data to Feishu Bitable
  • Cloud OCR: For enhanced OCR accuracy (Google Vision, Azure OCR, etc.)

Usage Examples

Example 1: Extract Chart Data

cd /root/.openclaw/workspace/skills/playwright_ocr
node scripts/extract_data.js --url "https://example.com/chart" --output data.json
python3 scripts/process_ocr.py --input screenshots/ --output data.csv

Example 2: Full Pipeline

# Configure target URL and selectors
export TARGET_URL="https://openrouter.ai/apps?url=https%3A%2F%2Fopenclaw.ai%2F"
export OUTPUT_DIR="/root/.openclaw/workspace/output"

# Run extraction
python3 scripts/run_pipeline.py

Example 3: Scheduled Extraction

# Add to crontab for daily extraction
0 2 * * * cd /root/.openclaw/workspace/skills/playwright_ocr && python3 scripts/run_pipeline.py

Workflow

  1. Browser Automation (Playwright)

    • Navigate to target URL
    • Wait for dynamic content to load
    • Interact with elements (hover, click, etc.)
    • Capture screenshots of data regions
  2. OCR Processing (Tesseract)

    • Pre-process images (enhance contrast, remove noise)
    • Extract text using OCR
    • Parse structured data (tables, charts)
  3. Data Export

    • Clean and validate extracted data
    • Export to CSV/Excel format
    • Upload to target system (Feishu Bitable, database, etc.)

Output Format

CSV Output

日期,模型名称,Token 消耗,请求次数,费用 (USD)
2026-02-16,Others,67500000000,0,0
2026-02-16,Step 3.5 Flash,55300000000,0,0

JSON Output

{
  "extraction_date": "2026-03-18",
  "source_url": "https://openrouter.ai/apps",
  "data": [
    {
      "date": "2026-02-16",
      "model": "Others",
      "tokens": 67500000000
    }
  ]
}

Error Handling

  • Timeout: Increase wait time in extract_data.js
  • OCR Accuracy: Use image pre-processing or cloud OCR
  • Rate Limiting: Add delays between requests
  • Authentication: Configure credentials in .env file

Best Practices

  1. Respect robots.txt: Check website's crawling policy
  2. Rate Limiting: Add delays to avoid overwhelming servers
  3. Error Recovery: Implement retry logic for failed extractions
  4. Data Validation: Verify extracted data before export
  5. Logging: Maintain detailed logs for debugging

Troubleshooting

Issue: Playwright fails to launch

# Install system dependencies
npx playwright install-deps

Issue: OCR accuracy is poor

# Install additional language packs
sudo apt-get install tesseract-ocr-eng
# Use image pre-processing
python3 scripts/preprocess_image.py --input screenshot.png

Issue: Data extraction incomplete

# Increase wait time for dynamic content
# Check selectors in extract_data.js
# Enable debug mode: export DEBUG=playwright:*

Related Skills

  • web-content-fetcher: For simple web page content extraction
  • self-improving: For learning from extraction errors
  • feishu-bitable: For uploading extracted data to Feishu Bitable

Changelog

v2.0.0 (2026-04-03) - Major Update

新增功能:

  1. 批量处理

    • ✅ 支持整个目录批量 OCR 处理
    • ✅ 自动去重(相同文件只处理一次)
    • ✅ 进度跟踪(显示完成百分比)
    • ✅ 并行处理(同时处理多个文件)
  2. 数据验证

    • ✅ OCR 结果自动校验(置信度检查)
    • ✅ 置信度阈值过滤(<90% 标记为待审核)
    • ✅ 人工审核队列生成
    • ✅ 数据完整性检查
  3. 错误恢复

    • ✅ 断点续传(从中断位置继续)
    • ✅ 失败重试(最多 3 次)
    • ✅ 详细日志(记录每个步骤)
    • ✅ 状态保存(重启后恢复)
  4. PaddleOCR 集成

    • ✅ 支持 PaddleOCR(中文识别更准确)
    • ✅ 多语言支持(简中/繁中/英文)
    • ✅ 自动选择最佳 OCR 引擎

使用示例:

# 批量处理整个目录
python3 scripts/batch_ocr_processor.py \
  --input /path/to/pdfs \
  --output /path/to/results \
  --lang chinese_cht \
  --parallel 4

# 带验证的提取
python3 scripts/extract_data.js \
  --validate \
  --confidence-threshold 0.9 \
  --review-queue

性能提升:

  • 批量处理速度提升 3-5 倍
  • 中文识别准确率提升至 95%+
  • 错误恢复减少 80% 重复工作

v1.0.0 (2026-03-18)

  • Initial release
  • Playwright browser automation
  • Tesseract OCR integration
  • CSV/JSON export
  • Feishu Bitable upload support

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Harbor Skills

Harbor 镜像仓库综合管理技能。用于 Harbor 日常运维、项目与镜像管理、安全扫描、清理策略、CI/CD 集成、GitOps、复制规则、存储管理、备份恢复、webhook 联动等所有 Harbor 相关操作。当用户提到 Harbor、镜像仓库管理、Docker 镜像、镜像安全扫描、CI/CD 镜像推送/拉...

Registry SourceRecently Updated
Automation

Dynamics Crm

Microsoft Dynamics 365 integration. Manage crm and sales data, records, and workflows. Use when the user wants to interact with Microsoft Dynamics 365 data.

Registry SourceRecently Updated
Automation

Jira

Jira integration. Manage project management and ticketing data, records, and workflows. Use when the user wants to interact with Jira data.

Registry SourceRecently Updated
Automation

Generate Education Ad Creative Brief

Plan campaign visuals and hooks for education promotions. Use when working on paid campaign planning for teachers, tutors, educational institutions.

Registry SourceRecently Updated