Web Content Fetcher

Expert knowledge for fetching and parsing web content, handling size limitations, JavaScript-rendered sites, and extracting clean article content from HTML.

Overview

This skill provides battle-tested strategies for fetching web content in Claude Code, addressing critical challenges:

WebFetch tool limitation: ~50KB max content size
Read tool limitation: 256KB max file size per call
JavaScript-rendered content: Twitter/X, React SPAs require special handling

Four-tier approach:

Tier 1: WebFetch for small content (< 50KB)
Tier 2: curl + Task agent for any size static content ⭐ DEFAULT
Tier 3: Scripts for edge cases (EUC-JP encoding, complex HTML)
Tier 4: Playwright for JavaScript-rendered sites (Twitter/X, React SPAs)

Quick Tool Limitations

Tool Max Size Best For Limitation

WebFetch ~50KB Small articles, first attempt Prompt length constraints

Read 256KB Reading fetched files Large files need pagination

curl Unlimited Any size content, raw downloads No parsing

Task agent Unlimited Extraction from any size HTML Handles pagination automatically

Tiered Fetching Strategy

Tier 1: Small Content - WebFetch Direct

Use when: Simple articles, API responses, first attempt on unknown content

Quick example:

WebFetch tool with url and extraction prompt

If fails: "Prompt too long" error → Switch to Tier 2

Tier 2: Any Size Content - curl + Task Agent ⭐ RECOMMENDED

Use when: News articles, blog posts, WebFetch fails, any content of any size

Quick workflow:

1. Create directory

mkdir -p output/tasks/YYYYMMDD_taskname/original

2. Fetch HTML

curl -s "URL" > output/tasks/YYYYMMDD_taskname/original/raw_html.html

3. Extract with Task agent

Prompt: "Read HTML at [path], extract article content, save to fetched_content.md"

Why default: Handles any size, AI-powered extraction, no script setup needed.

For detailed workflow: See references/workflows.md

Tier 3: Script-Based Extraction - Edge Cases Only

Use when: Task agent struggles with complex HTML or special encoding requirements (EUC-JP)

Available scripts:

scripts/extract_article.js
Standard HTML with Mozilla Readability
scripts/extract_eucjp.js
Japanese EUC-JP encoded sites (4gamer.net)

Quick example:

node .claude/skills/web-content-fetcher/scripts/extract_article.js raw_html.html > fetched_content.md

Note: Try Task agent (Tier 2) first. Scripts only when Task agent explicitly fails.

For script details: See references/advanced-fetching.md

Tier 4: JavaScript-Rendered Content - Playwright

Use when: Twitter/X, React SPAs, dynamic sites, static fetch returns empty/error

Quick example:

node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js
"https://x.com/user/status/123456789"
--output fetched_content.md

Common sites: Twitter/X, Threads, Instagram, React/Vue/Angular SPAs, AJAX-loaded content

Image Extraction:

Automatically extracts images if exist (Twitter/X, Threads, blogs, etc.)
Captures image URLs with alt text from main content area
Filters out small images (icons/logos) automatically
Output includes structured Media section with URLs for downloading

Setup required (one-time):

cd .claude/skills/web-content-fetcher pnpm add -D playwright --save-catalog-name=dev pnpm exec playwright install chromium

For Playwright details: See references/advanced-fetching.md

Decision Tree

Use this flowchart to choose the right approach:

Need to fetch web content? │ ├─ JavaScript-rendered site (Twitter/X, React SPA, dynamic content)? │ └─ Use Playwright script (Tier 4) │ └─ See: references/advanced-fetching.md │ ├─ Size unknown or small expected content? │ └─ Try WebFetch (Tier 1) │ ├─ Success? → Done ✓ │ └─ "Prompt too long" error? → Use Tier 2 │ ├─ Any other case (medium/large content, WebFetch failed)? │ └─ Use curl + Task agent (Tier 2) ⭐ DEFAULT │ ├─ Success? → Done ✓ │ └─ Task agent struggles? → Use Tier 3 │ └─ See: references/advanced-fetching.md │ └─ Edge cases (Task agent fails, special encoding)? └─ Use curl + script (Tier 3) └─ See: references/advanced-fetching.md

Standard Workflow Quick Reference

Most common pattern (works for 90% of cases):

1. Setup

mkdir -p output/tasks/20260111_taskname/original

2. Fetch

curl -s "URL" > output/tasks/20260111_taskname/original/raw_html.html

3. Extract (Task agent prompt)

"Read HTML file at [path], extract main article content, remove navigation/ads/sidebars/comments/footer, save clean markdown to: [path]/fetched_content.md"

4. Use

Read: output/tasks/20260111_taskname/original/fetched_content.md

For detailed workflows and patterns: See references/workflows.md

Common Use Cases

Translation with images:

Fetch content (Tier 2 or Tier 4 for JS sites)
Extract to original/fetched_content.md with readable format (includes Media section with image URLs)
Download images using URLs from Media section
Pass to /translate command

Analysis:

Fetch content (Tier 2 or Tier 4)
Extract to original/fetched_content.md with readable format
Read and analyze

Social media posts (Twitter/X, Threads, Instagram):

Use Playwright (Tier 4) directly
Automatically extracts text, images, and metadata
Outputs to fetched_content.md with Media section in a readable format
Download images from extracted URLs
Use for translation/analysis

For all patterns: See references/workflows.md

When Things Go Wrong

Common issues and quick fixes:

Issue Quick Fix Details

WebFetch "Prompt too long" Switch to Tier 2 troubleshooting.md

Read tool file too large Use Task agent troubleshooting.md

Garbled Japanese text EUC-JP encoding issue troubleshooting.md

JavaScript required Use Playwright (Tier 4) troubleshooting.md

Anti-bot protection Add user agent troubleshooting.md

Authentication required Use curl with headers/cookies troubleshooting.md

For complete troubleshooting guide: See references/troubleshooting.md

Best Practices

Always use task directories: output/tasks/YYYYMMDD_taskname/
Default to Tier 2: curl + Task agent works for nearly all cases
Keep raw HTML: Save to original/raw_html.html for reference (except Tier 4)
Use Playwright for JS sites: Twitter/X, React SPAs, dynamic content
Clean content format: Always save extracted content as markdown
Descriptive naming: Use date prefix (YYYYMMDD_ ) and descriptive task names
Try simple first: WebFetch → curl + Task agent → Scripts (only if needed)
Task agent for extraction: Let AI handle pagination and parsing complexity

Reference Files

This skill uses progressive disclosure for efficiency. Core information is in this file. Detailed guides are in reference files:

references/workflows.md - Standard workflows, common patterns, directory structure, content extraction priorities
references/advanced-fetching.md - Tier 3 script-based extraction, Tier 4 Playwright details, setup instructions
references/troubleshooting.md - Common issues, solutions, quick reference commands

Read reference files when you need detailed guidance for specific scenarios.

Quick Start Examples

Simple article:

mkdir -p output/tasks/20260111_article/original curl -s "https://example.com/article" > output/tasks/20260111_article/original/raw_html.html

Task agent: Extract to fetched_content.md

Twitter/X post:

mkdir -p output/tasks/20260111_twitter/original node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js
"https://x.com/user/status/123"
--output output/tasks/20260111_twitter/original/fetched_content.md

Threads post:

mkdir -p output/tasks/20260111_threads/original node .claude/skills/web-content-fetcher/scripts/fetch_js_content.js
"https://www.threads.com/@user/post/ABC123"
--output output/tasks/20260111_threads/original/fetched_content.md

Output format with images (works for all sites):

Title

Main content text here...

Media

Image description or alt text
- URL: https://example.com/image1.jpg
Another image
- URL: https://example.com/image2.jpg

Note:

Images are automatically extracted from the main content area
Small images (< 100x100px) like icons/logos are filtered out
Image formats: .jpg , .png , .webp , .gif , .svg
all preserved in URLs
When downloading, respect the original file extension

Japanese site (potential encoding issue):

mkdir -p output/tasks/20260111_japanese/original curl -s "https://4gamer.net/..." > output/tasks/20260111_japanese/original/raw_html.html

Task agent with encoding detection, or use extract_eucjp.js if needed

web-content-fetcher

Safety Notice

Copy this and send it to your AI assistant to learn

1. Create directory

2. Fetch HTML

3. Extract with Task agent

Prompt: "Read HTML at [path], extract article content, save to fetched_content.md"

1. Setup

2. Fetch

3. Extract (Task agent prompt)

4. Use

Task agent: Extract to fetched_content.md

Title

Media

Task agent with encoding detection, or use extract_eucjp.js if needed

Source Transparency

Related Skills

translation-expertise

document-writing

engineering-terminology