web-reader-pro

Advanced web content extraction skill for OpenClaw using multi-tier fallback strategy (Jina → Scrapling → WebFetch) with intelligent routing, caching, quality scoring, and domain learning. Use when: reading article content, extracting web page text, scraping dynamic JS-heavy pages, or fetching WeChat official account articles.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-reader-pro" with this command: npx skills add 0xcjl/web-reader-pro

Web Reader Pro - OpenClaw Skill

Overview

Web Reader Pro is an advanced web content extraction skill for OpenClaw that uses a multi-tier fallback strategy with intelligent routing, caching, and quality assessment.

Features

1. Three-Tier Fallback Strategy

  • Tier 1: Jina Reader API - Fast, reliable, best for most websites
  • Tier 2: Scrapling + Playwright - Dynamic content rendering for JS-heavy sites
  • Tier 3: WebFetch Fallback - Basic extraction for simple pages

2. Jina Quota Monitoring

  • Tracks API call count with persistent counter
  • Warning alerts when approaching quota limits
  • Automatic fallback to lower-tier methods when quota exhausted

3. Smart Cache Layer

  • Short-term caching (configurable TTL, default 1 hour)
  • Cache key based on URL hash
  • Reduces redundant API calls

4. Extraction Quality Scoring

  • Scores based on: word count, title detection, content density
  • Minimum quality threshold (default: 200 words + valid title)
  • Auto-escalation to next tier if quality below threshold

5. Domain-Level Routing Learning

  • Learns optimal extraction tier per domain
  • Persists learned routes in local JSON database
  • Adapts based on historical success rates

6. Retry with Exponential Backoff

  • Configurable max retries per tier (default: 3)
  • Exponential backoff: 1s, 2s, 4s, 8s...
  • Respects rate limits and transient failures

Installation

# Install dependencies
pip install -r requirements.txt

# Install Scrapling (requires Node.js)
./scripts/install_scrapling.sh

# Or install Scrapling manually
npm install -g @scrapinghub/scrapling

Usage

Basic Usage

from scripts.web_reader_pro import WebReaderPro

reader = WebReaderPro()
result = reader.fetch("https://example.com")
print(result['title'])
print(result['content'])

Advanced Configuration

reader = WebReaderPro(
    jina_api_key="your-jina-key",      # Optional: set via env JINA_API_KEY
    cache_ttl=3600,                      # Cache TTL in seconds (default: 3600)
    quality_threshold=200,               # Min word count for quality (default: 200)
    max_retries=3,                       # Max retries per tier (default: 3)
    enable_learning=True,                # Enable domain learning (default: True)
    scrapling_path="/usr/local/bin/scrapling"  # Path to scrapling binary
)

Result Format

{
    "title": "Page Title",
    "content": "Extracted content in markdown...",
    "url": "https://example.com",
    "tier_used": "jina|scrapling|webfetch",
    "quality_score": 85,
    "cached": False,
    "domain_learned_tier": "jina",
    "extracted_at": "2024-01-01T00:00:00Z"
}

Environment Variables

VariableDescriptionDefault
JINA_API_KEYJina Reader API keyRequired for Tier 1
WEB_READER_CACHE_DIRCache directory path~/.openclaw/cache/web-reader-pro/
WEB_READER_LEARNING_DBLearning database path~/.openclaw/data/web-reader-pro/routes.json
WEB_READER_JINA_QUOTAJina quota limit100000

API Reference

WebReaderPro.fetch(url, force_refresh=False)

Fetch and extract content from a URL.

Parameters:

  • url (str): Target URL
  • force_refresh (bool): Bypass cache if True

Returns: Dict with title, content, metadata

WebReaderPro.fetch_with_tier(url, preferred_tier)

Fetch using a specific tier (bypassing automatic selection).

Parameters:

  • url (str): Target URL
  • preferred_tier (str): "jina", "scrapling", or "webfetch"

WebReaderPro.get_jina_status()

Get current Jina API quota usage.

Returns: Dict with count, limit, percentage, warnings

WebReaderPro.clear_cache(url=None)

Clear cache for specific URL or all URLs.

Parameters:

  • url (str, optional): Specific URL to clear, or None for all

WebReaderPro.get_domain_routes()

Get learned domain-to-tier mappings.

Returns: Dict of domain -> preferred tier

Tier Comparison

TierSpeedJS RenderingBest ForCost
JinaFastNoStatic pages, articlesAPI calls
ScraplingMediumYesSPAs, dynamic contentCPU
WebFetchFastestNoSimple pages, fallbacksFree

License

MIT

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

Baoyu Danger Gemini Web

Generates images and text via reverse-engineered Gemini Web API. Supports text generation, image generation from prompts, reference images for vision input,...

Registry SourceRecently Updated
8310wjctim
General

Easypost

EasyPost — shipping labels, rate comparison, package tracking, address verification, and insurance.

Registry SourceRecently Updated
General

Update Advisor

OpenClaw update check and upgrade assistant. Triggers on phrases like "check for updates", "any new version", "is openclaw updated", "run the update", "confi...

Registry SourceRecently Updated
General

Memory Management

Manage and standardize trading decision records, extract lessons, and support history retrieval and comparison within the PAI trading system.

Registry SourceRecently Updated