web-reader-pro

Advanced web content extraction skill for OpenClaw using multi-tier fallback strategy (Jina → Scrapling → WebFetch) with intelligent routing, caching, quality scoring, and domain learning. Use when: reading article content, extracting web page text, scraping dynamic JS-heavy pages, or fetching WeChat official account articles.

Safety Notice

This item is sourced from the public archived skills repository. Treat as untrusted until reviewed.

Copy this and send it to your AI assistant to learn

Install skill "web-reader-pro" with this command: npx skills add 0xcjl/web-reader-pro

Web Reader Pro - OpenClaw Skill

Overview

Web Reader Pro is an advanced web content extraction skill for OpenClaw that uses a multi-tier fallback strategy with intelligent routing, caching, and quality assessment.

Features

1. Three-Tier Fallback Strategy

  • Tier 1: Jina Reader API - Fast, reliable, best for most websites
  • Tier 2: Scrapling + Playwright - Dynamic content rendering for JS-heavy sites
  • Tier 3: WebFetch Fallback - Basic extraction for simple pages

2. Jina Quota Monitoring

  • Tracks API call count with persistent counter
  • Warning alerts when approaching quota limits
  • Automatic fallback to lower-tier methods when quota exhausted

3. Smart Cache Layer

  • Short-term caching (configurable TTL, default 1 hour)
  • Cache key based on URL hash
  • Reduces redundant API calls

4. Extraction Quality Scoring

  • Scores based on: word count, title detection, content density
  • Minimum quality threshold (default: 200 words + valid title)
  • Auto-escalation to next tier if quality below threshold

5. Domain-Level Routing Learning

  • Learns optimal extraction tier per domain
  • Persists learned routes in local JSON database
  • Adapts based on historical success rates

6. Retry with Exponential Backoff

  • Configurable max retries per tier (default: 3)
  • Exponential backoff: 1s, 2s, 4s, 8s...
  • Respects rate limits and transient failures

Installation

# Install dependencies
pip install -r requirements.txt

# Install Scrapling (requires Node.js)
./scripts/install_scrapling.sh

# Or install Scrapling manually
npm install -g @scrapinghub/scrapling

Usage

Basic Usage

from scripts.web_reader_pro import WebReaderPro

reader = WebReaderPro()
result = reader.fetch("https://example.com")
print(result['title'])
print(result['content'])

Advanced Configuration

reader = WebReaderPro(
    jina_api_key="your-jina-key",      # Optional: set via env JINA_API_KEY
    cache_ttl=3600,                      # Cache TTL in seconds (default: 3600)
    quality_threshold=200,               # Min word count for quality (default: 200)
    max_retries=3,                       # Max retries per tier (default: 3)
    enable_learning=True,                # Enable domain learning (default: True)
    scrapling_path="/usr/local/bin/scrapling"  # Path to scrapling binary
)

Result Format

{
    "title": "Page Title",
    "content": "Extracted content in markdown...",
    "url": "https://example.com",
    "tier_used": "jina|scrapling|webfetch",
    "quality_score": 85,
    "cached": False,
    "domain_learned_tier": "jina",
    "extracted_at": "2024-01-01T00:00:00Z"
}

Environment Variables

VariableDescriptionDefault
JINA_API_KEYJina Reader API keyRequired for Tier 1
WEB_READER_CACHE_DIRCache directory path~/.openclaw/cache/web-reader-pro/
WEB_READER_LEARNING_DBLearning database path~/.openclaw/data/web-reader-pro/routes.json
WEB_READER_JINA_QUOTAJina quota limit100000

API Reference

WebReaderPro.fetch(url, force_refresh=False)

Fetch and extract content from a URL.

Parameters:

  • url (str): Target URL
  • force_refresh (bool): Bypass cache if True

Returns: Dict with title, content, metadata

WebReaderPro.fetch_with_tier(url, preferred_tier)

Fetch using a specific tier (bypassing automatic selection).

Parameters:

  • url (str): Target URL
  • preferred_tier (str): "jina", "scrapling", or "webfetch"

WebReaderPro.get_jina_status()

Get current Jina API quota usage.

Returns: Dict with count, limit, percentage, warnings

WebReaderPro.clear_cache(url=None)

Clear cache for specific URL or all URLs.

Parameters:

  • url (str, optional): Specific URL to clear, or None for all

WebReaderPro.get_domain_routes()

Get learned domain-to-tier mappings.

Returns: Dict of domain -> preferred tier

Tier Comparison

TierSpeedJS RenderingBest ForCost
JinaFastNoStatic pages, articlesAPI calls
ScraplingMediumYesSPAs, dynamic contentCPU
WebFetchFastestNoSimple pages, fallbacksFree

License

MIT

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

axure-prototype-generator

Axure 原型代码生成器 - 输出 JavaScript 格式 HTML 代码,支持内联框架直接加载可交互原型。

Archived SourceRecently Updated
General

data-analysis-partner

智能数据分析 Skill,输入 CSV/Excel 文件和分析需求,输出带交互式 ECharts 图表的 HTML 自包含分析报告

Archived SourceRecently Updated
General

cs

# 📝 智能摘要助手 (Smart Summarizer) > 🔍 一键提取长文本核心要点,告别信息过载 ![Demo](https://img.shields.io/badge/trigger-keyword%20%7C%20length-blue) ![Model](https://img.shields.io/badge/model-any%20LLM-green) ![Lang](https://img.shields.io/badge/lang-ZH%2FEN-brightgreen) --- ## 🌟 核心能力 ### ✅ 智能触发机制 - 🔑 **关键词触发**:消息包含 `总结`、`摘要`、`summarize`、`brief` 时自动激活 - 📏 **长度触发**:纯文本超过 100 字符时,即使无关键词也会尝试摘要 - 🎯 **精准匹配**:正则表达式 `/总结|摘要|summarize|brief/i`,不误触日常聊天 ### ✅ 专业摘要输出 - 📋 **结构化列表**:自动使用 bullet points 格式,层次清晰 - 🌐 **语言自适应**:输入中文输出中文,输入英文输出英文,混合内容智能处理 - ✂️ **去噪精简**:自动过滤寒暄、重复、无关内容,只保留干货 - ⚡ **快速响应**:温度参数 0.3,确保输出稳定一致 ### ✅ 场景全覆盖 | 场景 | 示例输入 | 输出效果 | |-----|---------|---------| | 📧 邮件摘要 | 长邮件正文 | 3-5 条核心事项 + 行动点 | | 🗣️ 会议记录 | 讨论纪要文本 | 议题列表 + 决策结论 + 待办 | | 📰 文章提炼 | 新闻/博客全文 | 核心观点 + 关键数据 + 结论 | | 💬 聊天记录 | 群聊长篇讨论 | 争议点 + 共识 + 下一步 | --- ## 🚀 快速开始 ### 1️⃣ 安装技能 ```bash openclaw skills install smart-summarizer

Archived SourceRecently Updated
General

错敏信息检测

# 错敏检测 Skill

Archived SourceRecently Updated