llms.txt Support Skill
Purpose
Single responsibility: Detect, fetch, and utilize llms.txt files that provide LLM-optimized documentation, enabling 10x faster documentation ingestion. (BP-4)
Background
The llms.txt standard (https://llmstxt.org/) provides a convention for websites to expose LLM-friendly documentation. Instead of scraping entire sites, check for llms.txt first.
File hierarchy (check in order):
-
llms-full.txt
-
Complete documentation (largest)
-
llms.txt
-
Standard documentation
-
llms-small.txt
-
Condensed documentation (smallest)
Grounding Checkpoint (Archetype 1 Mitigation)
Before executing, VERIFY:
-
Base URL is accessible
-
Check all three llms.txt variants in order
-
Validate file content is actual documentation (not error page)
-
Confirm file size is reasonable for the documentation scope
DO NOT assume llms.txt exists. Always probe first.
Uncertainty Escalation (Archetype 2 Mitigation)
ASK USER instead of guessing when:
-
Multiple llms.txt variants found - which size to use?
-
llms.txt content appears partial or outdated
-
File returns but content seems like error page
-
Site has llms.txt but content doesn't match expected documentation
NEVER assume llms.txt quality without verification.
Context Scope (Archetype 3 Mitigation)
Context Type Included Excluded
RELEVANT Target base URL, llms.txt content Full site scraping
PERIPHERAL llms.txt spec reference Other sites' llms.txt
DISTRACTOR Previous scraping attempts Unrelated documentation
Workflow Steps
Step 1: Detect llms.txt (Grounding)
Check for llms.txt variants (in order of preference)
curl -I https://example.com/llms-full.txt curl -I https://example.com/llms.txt curl -I https://example.com/llms-small.txt
Check common alternate locations
curl -I https://example.com/.well-known/llms.txt curl -I https://docs.example.com/llms.txt
Step 2: Validate Content
Fetch and inspect first 100 lines
curl -s https://example.com/llms.txt | head -100
Check file size
curl -sI https://example.com/llms.txt | grep -i content-length
Verify it's not an error page
curl -s https://example.com/llms.txt | grep -i "not found|error|404" && echo "WARNING: May be error page"
Step 3: Choose Variant
Variant Size Use Case
llms-full.txt
Large (1MB+) Complete documentation, full API reference
llms.txt
Medium Standard use, balanced coverage
llms-small.txt
Small (<100KB) Quick reference, limited context windows
Decision tree:
-
If context window is limited → llms-small.txt
-
If need complete coverage → llms-full.txt
-
Default → llms.txt
Step 4: Fetch and Process
Download llms.txt
curl -o docs/llms.txt https://example.com/llms.txt
Convert to skill format (if using skill-seekers)
skill-seekers scrape --llms-txt docs/llms.txt --name myskill
Or process manually
llms.txt is already LLM-optimized markdown
cp docs/llms.txt output/myskill/references/complete.md
Step 5: Validate Output
Check content structure
head -50 output/myskill/references/complete.md
Verify sections
grep "^#" output/myskill/references/complete.md | head -20
Check for code examples
grep -c '```' output/myskill/references/complete.md
Recovery Protocol (Archetype 4 Mitigation)
On error:
-
PAUSE - Note which variant failed
-
DIAGNOSE - Check error type:
-
404 Not Found → Try next variant or alternate location
-
403 Forbidden → May need authentication or user-agent
-
Timeout → Retry with longer timeout
-
Invalid content → Fall back to traditional scraping
-
ADAPT - Try alternate approach
-
RETRY - Next variant (max 3 attempts per variant)
-
ESCALATE - Inform user llms.txt unavailable, suggest scraping
Checkpoint Support
State saved to: .aiwg/working/checkpoints/llms-txt-support/
checkpoints/llms-txt-support/ ├── detection_results.json # Which variants found ├── selected_variant.txt # Which was chosen └── content_hash.txt # For cache validation
llms.txt Format Reference
Standard llms.txt structure:
Project Name
Brief description of the project
Overview
[High-level explanation]
Installation
[Setup instructions]
Quick Start
[Getting started guide]
API Reference
[Detailed API documentation]
Examples
[Code examples]
FAQ
[Common questions]
Detection Results Output
{ "base_url": "https://example.com", "detected": { "llms-full.txt": { "found": true, "url": "https://example.com/llms-full.txt", "size": 1523456, "last_modified": "2025-01-15T10:30:00Z" }, "llms.txt": { "found": true, "url": "https://example.com/llms.txt", "size": 245678, "last_modified": "2025-01-15T10:30:00Z" }, "llms-small.txt": { "found": false } }, "recommended": "llms.txt", "reason": "Standard size, good for most use cases" }
Known Sites with llms.txt
Sites known to support llms.txt (verify before use):
-
Anthropic documentation
-
Many modern API documentation sites
-
Framework documentation following the standard
Always verify - this list may be outdated.
Troubleshooting
Issue Diagnosis Solution
No llms.txt found Site doesn't support Fall back to doc-scraper
Content seems wrong Error page or redirect Check actual content, verify URL
File too large llms-full.txt overwhelming Use llms.txt or llms-small.txt
Outdated content llms.txt not maintained Consider scraping + llms.txt merge
Integration with doc-scraper
If llms.txt is incomplete or outdated, combine approaches:
1. Fetch llms.txt as base
curl -o base.md https://example.com/llms.txt
2. Scrape for additional/updated content
skill-seekers scrape --config config.json --skip-covered-by base.md
3. Merge results
llms.txt provides structure, scraping fills gaps
References
-
llms.txt Standard: https://llmstxt.org/
-
Skill Seekers llms.txt Detection: https://github.com/jmagly/Skill_Seekers/blob/main/docs/LLMS_TXT_SUPPORT.md
-
REF-001: Production-Grade Agentic Workflows (BP-4, BP-9)
-
REF-002: LLM Failure Modes (Archetype 1-4 mitigations)