web-fetch

Fetch web content and convert to clean Markdown and PDF formats. Supports general websites and WeChat (微信公众号) articles.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "web-fetch" with this command: npx skills add jst-well-dan/skill-box/jst-well-dan-skill-box-web-fetch

Web Fetch

Fetch web content and convert to clean Markdown and PDF formats. Supports general websites and WeChat (微信公众号) articles.

Features

  • Automatic noise removal (navigation, headers, footers, sidebars)

  • Image preservation with alt text

  • WeChat article special handling (lazy-loaded images, metadata extraction)

  • Clean Markdown output ready for translation or processing

  • PDF conversion with clean reading style

  • CJK font support for Chinese content

  • Both MD and PDF output by default

Dependencies

Core dependencies

pip install crawl4ai requests beautifulsoup4 markdownify

WeChat article fetching

pip install playwright playwright install chromium

PDF conversion with CJK font support

pip install reportlab markdown beautifulsoup4

Note: reportlab provides excellent CJK font support and works on Windows/Mac/Linux without system dependencies.

Usage

General Web Pages

For most websites, use the crawl4ai-based fetcher:

python scripts/fetch_web_content.py <url> <output_filename>

Example:

python scripts/fetch_web_content.py https://example.com/article article.md

WeChat Articles (微信公众号)

For WeChat articles, use the Playwright-based fetcher with anti-bot bypass:

python scripts/fetch_weixin.py <url> [output_filename]

Examples:

Auto-generate filename (YYYYMMDD+Title format)

python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx"

Custom filename

python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx" article.md

Features:

  • Uses real Chromium browser to bypass anti-bot protections

  • Handles lazy-loaded images automatically

  • Auto-generates filename from publish date + title (YYYYMMDD格式)

  • Supports both visible browser (for debugging) and headless mode

Convert Markdown to PDF

After fetching content to Markdown, convert to PDF:

python scripts/md_to_pdf.py <markdown_file> [--output output.pdf]

Examples:

Convert single file to PDF (auto-generates output name)

python scripts/md_to_pdf.py article.md

Convert with custom output name

python scripts/md_to_pdf.py article.md --output custom_name.pdf

Batch convert entire directory

python scripts/md_to_pdf.py ./articles_folder --concurrency 4

Features:

  • Excellent Chinese (CJK) font support using Microsoft YaHei

  • Image rendering support (HTTP/HTTPS URLs and local paths)

  • Automatic image scaling with aspect ratio preservation

  • Both single file and batch directory conversion

  • Clean, readable typography optimized for Chinese content

Response Pattern (Updated)

When user requests web content fetching:

Identify URL type:

  • WeChat URL (mp.weixin.qq.com ) → use fetch_weixin.py

  • Other URLs → use fetch_web_content.py

Determine output format:

  • User mentions "PDF" explicitly → MD + PDF

  • User says "only MD"/"no PDF"/"markdown only" → MD only

  • Ambiguous request → Ask: "Would you like PDF format as well?"

Detection examples:

  • "Fetch as PDF" / "转换为PDF" → MD + PDF

  • "Save to PDF" → MD + PDF

  • "Get markdown only" / "只要markdown" → MD only

  • "Fetch this article" → Ask user

  • "抓取网页内容" → Ask user

Execute fetching:

python scripts/fetch_web_content.py <url> <output>.md

or

python scripts/fetch_weixin.py <url> [output].md

Note: For WeChat articles, output filename is optional - it auto-generates as YYYYMMDD+Title

Convert to PDF (if requested):

python scripts/md_to_pdf.py <output>.md

This creates <output>.pdf alongside <output>.md

Report results:

  • Confirm both files saved (if PDF)

  • Show statistics for both formats

  • Suggest next steps

Example Workflows

Workflow 1: Fetch with PDF (Explicit Request)

User: "Fetch this article as PDF: https://example.com/article"

Step 1: Fetch markdown

python scripts/fetch_web_content.py https://example.com/article article.md

Step 2: Convert to PDF

python scripts/md_to_pdf.py article.md

Result:

✓ Saved: article.md (45 KB, 8,234 words)

✓ PDF: article.pdf (with images embedded)

Workflow 2: Fetch Markdown Only

User: "Get the markdown only"

Step 1: Fetch markdown

python scripts/fetch_web_content.py https://example.com/article article.md

Step 2: Skip PDF conversion

Result:

✓ Saved: article.md (45 KB, 8,234 words)

Workflow 3: Ambiguous Request

User: "Fetch this article: https://example.com/article"

Claude asks: "I'll fetch this article. Would you like me to convert it to PDF as well?"

User: "Yes"

Then proceed with Workflow 1

Workflow 4: WeChat Article with PDF

User: "抓取微信文章为PDF"

Step 1: Fetch markdown (auto-generates filename as YYYYMMDD+Title)

python scripts/fetch_weixin.py "https://mp.weixin.qq.com/s/xxxxx"

Step 2: Convert to PDF (use the auto-generated filename)

python scripts/md_to_pdf.py 20251214关于财政政策和货币政策的关系.md

Result:

✓ Saved: 20251214关于财政政策和货币政策的关系.md (中文内容)

✓ PDF: 20251214关于财政政策和货币政策的关系.pdf (完美支持中文和图片)

Batch Processing

For multiple URLs, loop through and fetch each:

for url in url1 url2 url3; do filename="output_$(date +%s)" python scripts/fetch_web_content.py "$url" "$filename.md" python scripts/md_to_pdf.py "$filename.md" # Optional: add PDF done

Troubleshooting

Issue Solution

Empty content Try different CSS selector or use WeChat Playwright fetcher

Missing images Check if site blocks external requests

Encoding issues Content is saved as UTF-8 by default

WeChat blocked Use Playwright fetcher - it launches real browser to bypass anti-bot

WeChat timeout Script has 60s timeout with retry - usually succeeds on second attempt

Playwright not installed Run: pip install playwright && playwright install chromium

PDF conversion failed Install dependencies: pip install reportlab markdown beautifulsoup4

Chinese characters in PDF Microsoft YaHei font is automatically used (excellent CJK support)

Images missing in PDF Check that image URLs are accessible or local image paths are correct

PDF too large Images are embedded and scaled; original image size affects PDF size

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

advanced-video-downloader

No summary provided by upstream source.

Repository SourceNeeds Review
General

excel-pivot-wizard

No summary provided by upstream source.

Repository SourceNeeds Review
General

excel-dcf-modeler

No summary provided by upstream source.

Repository SourceNeeds Review