arcfetch
Guide for using the arcfetch CLI to fetch web content, extract articles as clean markdown, and manage cached references.
Overview
arcfetch converts web pages to clean markdown with:
- Automatic Playwright fallback when simple HTTP fetch produces low-quality results
- Quality scoring (0-100) with boilerplate/error page/login wall/paywall detection
- Content-to-source ratio analysis to catch JS-rendered or gated content
- Anti-bot detection measures (stealth plugin, viewport/timezone rotation, realistic headers)
Installation
# Use via bunx (no install needed)
bunx arcfetch <command>
# Or install globally
bun install -g arcfetch
Commands
fetch
Fetch a URL and save extracted markdown to the temp folder.
arcfetch fetch <url> [options]
Options:
-q, --query <text>- Search query (saved as metadata)-o, --output <format>- Output format:text- Plain text, LLM-friendly (default)json- Structured JSONpath- Just the filepathsummary- slug|filepath format
--pretty- Human-friendly output with emojis-v, --verbose- Show detailed output (quality scores, pipeline decisions)--min-quality <n>- Minimum quality score 0-100 (default: 60)--temp-dir <path>- Temp folder (default: .tmp/arcfetch)--docs-dir <path>- Docs folder (default: docs/ai/references)--wait-strategy <mode>- Playwright wait: networkidle (default), domcontentloaded, load--force-playwright- Skip simple fetch, use Playwright directly--refetch- Re-fetch even if URL already cached
Examples:
# Basic fetch (plain text output for LLMs)
arcfetch fetch https://example.com/article
# Get just the filepath
arcfetch fetch https://example.com -o path
# Human-friendly output
arcfetch fetch https://example.com --pretty
# JSON output for scripting
arcfetch fetch https://example.com -o json
# With search query metadata
arcfetch fetch https://example.com -q "search term"
# Verbose mode to see pipeline decisions
arcfetch fetch https://example.com -v
# Force Playwright for JS-heavy sites
arcfetch fetch https://example.com --force-playwright
Quality Pipeline:
- Simple HTTP fetch with browser-like User-Agent
- Extract content with Readability + Turndown
- Quality score (0-100) with boilerplate detection
- Score >= 85: accept as-is
- Score 60-84: try Playwright, use whichever scores higher
- Score < 60: require Playwright, fail if still below threshold
list
List all cached references.
arcfetch list [options]
Options:
-o, --output <format>- Output format: text, json--pretty- Human-friendly output
arcfetch list --pretty
arcfetch list -o json
links
Extract all links from a cached reference.
arcfetch links <ref-id> [options]
Options:
-o, --output <format>- Output format: text, json--pretty- Human-friendly output
arcfetch links my-article --pretty
arcfetch links my-article -o json
fetch-links
Fetch all links from a cached reference (parallel, max 5 concurrent).
arcfetch fetch-links <ref-id> [options]
Options:
--refetch- Force re-fetch even if already cached-o, --output <format>- Output format: text, json--pretty- Human-friendly output
arcfetch fetch-links my-article --pretty
promote
Move reference from temp to permanent docs folder.
arcfetch promote <ref-id> [options]
arcfetch promote my-article --pretty
arcfetch promote my-article -o json
delete
Delete a cached reference.
arcfetch delete <ref-id> [options]
arcfetch delete my-article --pretty
config
Show current configuration.
arcfetch config
mcp
Start the MCP server (for Claude Code integration).
arcfetch mcp
Workflow Patterns
Single Article
arcfetch fetch https://example.com/guide --pretty
cat .tmp/arcfetch/example-guide.md # Review content
arcfetch promote example-guide # Move to docs if good
Batch Fetch
for url in "url1" "url2" "url3"; do
arcfetch fetch "$url" --pretty
done
arcfetch list --pretty # Review all
arcfetch promote my-article # Promote desired ones
Fetch All Links from a Page
arcfetch fetch https://example.com/resources --pretty
arcfetch links resources --pretty # See what links exist
arcfetch fetch-links resources --pretty # Fetch them all
Scripting with JSON Output
RESULT=$(arcfetch fetch https://example.com -o json)
REF_ID=$(echo "$RESULT" | jq -r '.refId')
QUALITY=$(echo "$RESULT" | jq -r '.quality')
if (( QUALITY >= 85 )); then
arcfetch promote "$REF_ID"
fi
Cleanup
arcfetch list --pretty
arcfetch delete unwanted-ref
Configuration
Priority Order
- CLI arguments
- Environment variables
arcfetch.config.json.arcfetchrc/.arcfetchrc.json- Built-in defaults
Config File (arcfetch.config.json)
{
"quality": {
"minScore": 60,
"jsRetryThreshold": 85
},
"paths": {
"tempDir": ".tmp/arcfetch",
"docsDir": "docs/ai/references"
},
"playwright": {
"timeout": 30000,
"waitStrategy": "networkidle"
}
}
Environment Variables
export SOFETCH_MIN_SCORE=60
export SOFETCH_TEMP_DIR=".tmp/arcfetch"
export SOFETCH_DOCS_DIR="docs/ai/references"
Quality Scoring
Score starts at 100 and deductions apply:
| Check | Deduction | Severity |
|---|---|---|
| Blank content | Score = 0 | Issue |
| Content < 50 chars | -50 | Issue |
| Content < 300 chars | -15 | Warning |
| HTML tags > 100 | -40 | Issue |
| HTML tags > 50 | -20 | Warning |
| HTML tags > 10 | -5 | Warning |
| HTML ratio > 30% | -25 | Issue |
| HTML ratio > 15% | -10 | Warning |
| Table tags > 50 | -30 | Issue |
| Script tags present | -15 | Warning |
| Style tags present | -10 | Warning |
| Extraction ratio < 0.5% (from large page) | -35 | Issue |
| Extraction ratio < 2% (from large page) | -20 | Warning |
| Boilerplate detected (error/login/paywall) | -40 | Issue |
| Excessive newlines | -5 | Warning |
Boilerplate patterns detected (on short content < 2000 chars):
- Error pages: "something went wrong", "an error occurred", "unexpected error"
- 404 pages: "page not found", "404 not found"
- Login walls: "log in to continue", "please log in", "sign in to continue"
- Paywalls: "subscribe to continue reading"
- Bot detection: "are you a robot", "complete the captcha", "verify you are human"
- Access denied, JS-required, unsupported browser pages
Score thresholds:
- >= 90: Excellent
- >= 75: Good
- >= 60: Acceptable (minimum to pass)
- < 60: Poor (rejected)
MCP Server
The MCP server exposes 6 tools for Claude Code integration:
| Tool | Description |
|---|---|
fetch_url | Fetch URL, extract markdown, save to temp |
list_cached | List all cached references |
promote_reference | Move temp reference to docs folder |
delete_cached | Delete a cached reference |
extract_links | Extract links from a cached reference |
fetch_links | Fetch all links from a cached reference |
Start via CLI: arcfetch mcp
Or configure in Claude Code MCP settings to run bunx arcfetch as stdio server.
File Format
Cached files use markdown with YAML frontmatter:
---
title: "Article Title"
source_url: https://example.com/article
fetched_date: 2026-02-06
type: web
status: temporary
query: "optional search query"
---
# Article Title
Extracted markdown content...
- Ref IDs are slugified titles (e.g.,
how-to-build-react-apps) - Temp storage:
.tmp/arcfetch/<slug>.md(status: temporary) - Permanent storage:
docs/ai/references/<slug>.md(status: permanent, after promote) - Duplicate detection: re-fetching same URL returns existing ref unless
--refetch
References
- Package: https://www.npmjs.com/package/arcfetch
- Repository: https://github.com/briansunter/arcfetch
- MCP Protocol: https://modelcontextprotocol.io