Web Fetch

Fetch web content and convert to clean Markdown and PDF formats. Supports general websites and WeChat (微信公众号) articles.

Features

Automatic noise removal (navigation, headers, footers, sidebars)
Image preservation with alt text
WeChat article special handling (lazy-loaded images, metadata extraction)
Clean Markdown output ready for translation or processing
PDF conversion with clean reading style
CJK font support for Chinese content
Both MD and PDF output by default

Dependencies

Core dependencies

pip install crawl4ai requests beautifulsoup4 markdownify

WeChat article fetching

pip install playwright playwright install chromium

PDF conversion with CJK font support

pip install reportlab markdown beautifulsoup4

Note: reportlab provides excellent CJK font support and works on Windows/Mac/Linux without system dependencies.

Usage

General Web Pages

For most websites, use the crawl4ai-based fetcher:

python scripts/fetch_web_content.py <url> <output_filename>

Example:

python scripts/fetch_web_content.py https://example.com/article article.md

WeChat Articles (微信公众号)

For WeChat articles, use the Playwright-based fetcher with anti-bot bypass:

python scripts/fetch_weixin.py <url> [output_filename]

Examples:

Auto-generate filename (YYYYMMDD+Title format)

python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx"

Custom filename

python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx" article.md

Features:

Uses real Chromium browser to bypass anti-bot protections
Handles lazy-loaded images automatically
Auto-generates filename from publish date + title (YYYYMMDD格式)
Supports both visible browser (for debugging) and headless mode

Convert Markdown to PDF

After fetching content to Markdown, convert to PDF:

python scripts/md_to_pdf.py <markdown_file> [--output output.pdf]

Examples:

Convert single file to PDF (auto-generates output name)

python scripts/md_to_pdf.py article.md

Convert with custom output name

python scripts/md_to_pdf.py article.md --output custom_name.pdf

Batch convert entire directory

python scripts/md_to_pdf.py ./articles_folder --concurrency 4

Features:

Excellent Chinese (CJK) font support using Microsoft YaHei
Image rendering support (HTTP/HTTPS URLs and local paths)
Automatic image scaling with aspect ratio preservation
Both single file and batch directory conversion
Clean, readable typography optimized for Chinese content

Response Pattern (Updated)

When user requests web content fetching:

Identify URL type:

WeChat URL (mp.weixin.qq.com ) → use fetch_weixin.py
Other URLs → use fetch_web_content.py

Determine output format:

User mentions "PDF" explicitly → MD + PDF
User says "only MD"/"no PDF"/"markdown only" → MD only
Ambiguous request → Ask: "Would you like PDF format as well?"

Detection examples:

"Fetch as PDF" / "转换为PDF" → MD + PDF
"Save to PDF" → MD + PDF
"Get markdown only" / "只要markdown" → MD only
"Fetch this article" → Ask user
"抓取网页内容" → Ask user

Execute fetching:

python scripts/fetch_web_content.py <url> <output>.md

or

python scripts/fetch_weixin.py <url> [output].md

Note: For WeChat articles, output filename is optional - it auto-generates as YYYYMMDD+Title

Convert to PDF (if requested):

python scripts/md_to_pdf.py <output>.md

This creates <output>.pdf alongside <output>.md

Report results:

Confirm both files saved (if PDF)
Show statistics for both formats
Suggest next steps

Example Workflows

Workflow 1: Fetch with PDF (Explicit Request)

User: "Fetch this article as PDF: https://example.com/article"

Step 1: Fetch markdown

python scripts/fetch_web_content.py https://example.com/article article.md

Step 2: Convert to PDF

python scripts/md_to_pdf.py article.md

Result:

✓ Saved: article.md (45 KB, 8,234 words)

✓ PDF: article.pdf (with images embedded)

Workflow 2: Fetch Markdown Only

User: "Get the markdown only"

Step 1: Fetch markdown

python scripts/fetch_web_content.py https://example.com/article article.md

Step 2: Skip PDF conversion

Result:

✓ Saved: article.md (45 KB, 8,234 words)

Workflow 3: Ambiguous Request

User: "Fetch this article: https://example.com/article"

Claude asks: "I'll fetch this article. Would you like me to convert it to PDF as well?"

User: "Yes"

Then proceed with Workflow 1

Workflow 4: WeChat Article with PDF

User: "抓取微信文章为PDF"

Step 1: Fetch markdown (auto-generates filename as YYYYMMDD+Title)

python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx"

Step 2: Convert to PDF (use the auto-generated filename)

python scripts/md_to_pdf.py 20251214关于财政政策和货币政策的关系.md

Result:

✓ Saved: 20251214关于财政政策和货币政策的关系.md (中文内容)

✓ PDF: 20251214关于财政政策和货币政策的关系.pdf (完美支持中文和图片)

Batch Processing

For multiple URLs, loop through and fetch each:

for url in url1 url2 url3; do filename="output_$(date +%s)" python scripts/fetch_web_content.py "$url" "$filename.md" python scripts/md_to_pdf.py "$filename.md" # Optional: add PDF done

Troubleshooting

Issue Solution

Empty content Try different CSS selector or use WeChat Playwright fetcher

Missing images Check if site blocks external requests

Encoding issues Content is saved as UTF-8 by default

WeChat blocked Use Playwright fetcher - it launches real browser to bypass anti-bot

WeChat timeout Script has 60s timeout with retry - usually succeeds on second attempt

Playwright not installed Run: pip install playwright && playwright install chromium

PDF conversion failed Install dependencies: pip install reportlab markdown beautifulsoup4

Chinese characters in PDF Microsoft YaHei font is automatically used (excellent CJK support)

Images missing in PDF Check that image URLs are accessible or local image paths are correct

PDF too large Images are embedded and scaled; original image size affects PDF size

web-fetch

Safety Notice

Copy this and send it to your AI assistant to learn

Core dependencies

WeChat article fetching

PDF conversion with CJK font support

Auto-generate filename (YYYYMMDD+Title format)

Custom filename

Convert single file to PDF (auto-generates output name)

Convert with custom output name

Batch convert entire directory

or

User: "Fetch this article as PDF: https://example.com/article"

Step 1: Fetch markdown

Step 2: Convert to PDF

Result:

✓ Saved: article.md (45 KB, 8,234 words)

✓ PDF: article.pdf (with images embedded)

User: "Get the markdown only"

Step 1: Fetch markdown

Step 2: Skip PDF conversion

Result:

✓ Saved: article.md (45 KB, 8,234 words)

User: "Fetch this article: https://example.com/article"

Claude asks: "I'll fetch this article. Would you like me to convert it to PDF as well?"

User: "Yes"

Then proceed with Workflow 1

User: "抓取微信文章为PDF"

Step 1: Fetch markdown (auto-generates filename as YYYYMMDD+Title)

Step 2: Convert to PDF (use the auto-generated filename)

Result:

✓ Saved: 20251214关于财政政策和货币政策的关系.md (中文内容)

✓ PDF: 20251214关于财政政策和货币政策的关系.pdf (完美支持中文和图片)

Source Transparency

Related Skills

advanced-video-downloader

excel-pivot-wizard

excel-dcf-modeler