Crawl Skill
Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.
Authentication
The script uses OAuth via the Tavily MCP server. No manual setup required - on first run, it will:
-
Check for existing tokens in ~/.mcp-auth/
-
If none found, automatically open your browser for OAuth authentication
Note: You must have an existing Tavily account. The OAuth flow only supports login — account creation is not available through this flow. Sign up at tavily.com first if you don't have an account.
Alternative: API Key
If you prefer using an API key, get one at https://tavily.com and add to ~/.claude/settings.json :
{ "env": { "TAVILY_API_KEY": "tvly-your-api-key-here" } }
Quick Start
Using the Script
./scripts/crawl.sh '<json>' [output_dir]
Examples:
Basic crawl
./scripts/crawl.sh '{"url": "https://docs.example.com"}'
Deeper crawl with limits
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2, "limit": 50}'
Save to files
./scripts/crawl.sh '{"url": "https://docs.example.com", "max_depth": 2}' ./docs
Focused crawl with path filters
./scripts/crawl.sh '{"url": "https://example.com", "max_depth": 2, "select_paths": ["/docs/.", "/api/."], "exclude_paths": ["/blog/.*"]}'
With semantic instructions (for agentic use)
./scripts/crawl.sh '{"url": "https://docs.example.com", "instructions": "Find API documentation", "chunks_per_source": 3}'
When output_dir is provided, each crawled page is saved as a separate markdown file.
Basic Crawl
curl --request POST
--url https://api.tavily.com/crawl
--header "Authorization: Bearer $TAVILY_API_KEY"
--header 'Content-Type: application/json'
--data '{
"url": "https://docs.example.com",
"max_depth": 1,
"limit": 20
}'
Focused Crawl with Instructions
curl --request POST
--url https://api.tavily.com/crawl
--header "Authorization: Bearer $TAVILY_API_KEY"
--header 'Content-Type: application/json'
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and code examples",
"chunks_per_source": 3,
"select_paths": ["/docs/.", "/api/."]
}'
API Reference
Endpoint
POST https://api.tavily.com/crawl
Headers
Header Value
Authorization
Bearer <TAVILY_API_KEY>
Content-Type
application/json
Request Body
Field Type Default Description
url
string Required Root URL to begin crawling
max_depth
integer 1 Levels deep to crawl (1-5)
max_breadth
integer 20 Links per page
limit
integer 50 Total pages cap
instructions
string null Natural language guidance for focus
chunks_per_source
integer 3 Chunks per page (1-5, requires instructions)
extract_depth
string "basic"
basic or advanced
format
string "markdown"
markdown or text
select_paths
array null Regex patterns to include
exclude_paths
array null Regex patterns to exclude
allow_external
boolean true Include external domain links
timeout
float 150 Max wait (10-150 seconds)
Response Format
{ "base_url": "https://docs.example.com", "results": [ { "url": "https://docs.example.com/page", "raw_content": "# Page Title\n\nContent..." } ], "response_time": 12.5 }
Depth vs Performance
Depth Typical Pages Time
1 10-50 Seconds
2 50-500 Minutes
3 500-5000 Many minutes
Start with max_depth=1 and increase only if needed.
Crawl for Context vs Data Collection
For agentic use (feeding results into context): Always use instructions
- chunks_per_source . This returns only relevant chunks instead of full pages, preventing context window explosion.
For data collection (saving to files): Omit chunks_per_source to get full page content.
Examples
For Context: Agentic Research (Recommended)
Use when feeding crawl results into an LLM context:
curl --request POST
--url https://api.tavily.com/crawl
--header "Authorization: Bearer $TAVILY_API_KEY"
--header 'Content-Type: application/json'
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find API documentation and authentication guides",
"chunks_per_source": 3
}'
Returns only the most relevant chunks (max 500 chars each) per page - fits in context without overwhelming it.
For Context: Targeted Technical Docs
curl --request POST
--url https://api.tavily.com/crawl
--header "Authorization: Bearer $TAVILY_API_KEY"
--header 'Content-Type: application/json'
--data '{
"url": "https://example.com",
"max_depth": 2,
"instructions": "Find all documentation about authentication and security",
"chunks_per_source": 3,
"select_paths": ["/docs/.", "/api/."]
}'
For Data Collection: Full Page Archive
Use when saving content to files for later processing:
curl --request POST
--url https://api.tavily.com/crawl
--header "Authorization: Bearer $TAVILY_API_KEY"
--header 'Content-Type: application/json'
--data '{
"url": "https://example.com/blog",
"max_depth": 2,
"max_breadth": 50,
"select_paths": ["/blog/."],
"exclude_paths": ["/blog/tag/.", "/blog/category/.*"]
}'
Returns full page content - use the script with output_dir to save as markdown files.
Map API (URL Discovery)
Use map instead of crawl when you only need URLs, not content:
curl --request POST
--url https://api.tavily.com/map
--header "Authorization: Bearer $TAVILY_API_KEY"
--header 'Content-Type: application/json'
--data '{
"url": "https://docs.example.com",
"max_depth": 2,
"instructions": "Find all API docs and guides"
}'
Returns URLs only (faster than crawl):
{ "base_url": "https://docs.example.com", "results": [ "https://docs.example.com/api/auth", "https://docs.example.com/guides/quickstart" ] }
Tips
-
Always use chunks_per_source for agentic workflows - prevents context explosion when feeding results to LLMs
-
Omit chunks_per_source only for data collection - when saving full pages to files
-
Start conservative (max_depth=1 , limit=20 ) and scale up
-
Use path patterns to focus on relevant sections
-
Use Map first to understand site structure before full crawl
-
Always set a limit to prevent runaway crawls