Jina Web Fetch
Use this skill to capture content from hard-to-fetch pages while keeping a deterministic workflow:
- Try direct fetch first.
- If direct fetch fails or looks blocked, retry via
jina.ai.
Quick Start
bash "<path-to-skill>/scripts/fetch_with_jina_fallback.sh" "<url>" "<output_file>"
Example:
bash "<path-to-skill>/scripts/fetch_with_jina_fallback.sh" \
"https://x.com/trq212/status/2027463795355095314" \
"raw/x-status.txt"
The script prints source=direct or source=jina to stderr so you can see which path was used.
If output_file is omitted, content is printed to stdout.
Default Workflow
- Run the script with the target URL.
- Save raw output under a traceable path like
raw/<slug>.txt. - Parse extracted text/markdown for:
- main body
- media links (images/videos)
- referenced URLs
- Keep original URL + raw capture together for auditability.
Blocking Heuristics
The script auto-falls back to jina.ai when direct content looks like:
- login wall / sign-up prompt
- JS-required page
- anti-bot / captcha / access denied page
- very small shell-like HTML page (default threshold
< 800bytes)
Environment Knobs
FETCH_TIMEOUT(default25)FETCH_CONNECT_TIMEOUT(default10)FETCH_MIN_BYTES(default800)JINA_FORCE=1to skip direct fetch and always usejina.ai
URL Format Note
Fallback URL is built as:
https://r.jina.ai/http://<original-host-and-path>