Anakin - Web Data Extraction
Convert websites into clean data at scale using the anakin-cli. Supports single URL scraping, batch scraping, AI-powered search, and autonomous deep research.
Installation & Authentication
Check status and authentication:
anakin status
Output when ready:
✓ Authenticated
Endpoint: https://api.anakin.io
Account: user@example.com
If not installed: pip install anakin-cli
Always refer to the installation rules in rules/install.md for more information if the user is not logged in.
If not authenticated, run:
anakin login --api-key "ak-your-key-here"
Get your API key from anakin.io/dashboard.
Organization
Create a .anakin/ folder in the working directory unless it already exists to store results. Add .anakin/ to the .gitignore file if not already there. Always use -o to write directly to file (avoids flooding context):
mkdir -p .anakin
echo ".anakin/" >> .gitignore
anakin scrape "https://example.com" -o .anakin/output.md
Capabilities
1. Scrape a Single URL
Extract content from a single web page in multiple formats.
When to use:
- Extracting content from a single web page
- Converting a webpage to clean markdown
- Extracting structured data from one URL
- Getting full raw API response with metadata
Basic usage:
# Clean readable text (default markdown format)
anakin scrape "https://example.com" -o output.md
# Structured data (JSON)
anakin scrape "https://example.com" --format json -o output.json
# Full API response with HTML and metadata
anakin scrape "https://example.com" --format raw -o output.json
Advanced options:
# JavaScript-heavy or single-page app sites
anakin scrape "https://example.com" --browser -o output.md
# Geo-targeted scraping (country code)
anakin scrape "https://example.com" --country gb -o output.md
# Custom timeout for slow pages (in seconds)
anakin scrape "https://example.com" --timeout 300 -o output.md
2. Batch Scrape Multiple URLs
Scrape up to 10 URLs at once for efficient parallel processing.
When to use:
- Scraping multiple web pages simultaneously
- Comparing products across different sites
- Collecting multiple articles or pages
- Gathering data from several sources at once
Basic usage:
# Batch scrape multiple URLs (up to 10)
anakin scrape-batch "https://example.com/page1" "https://example.com/page2" "https://example.com/page3" -o batch-results.json
For large lists (>10 URLs):
# First batch (URLs 1-10)
anakin scrape-batch "https://url1.com" ... "https://url10.com" -o batch-1.json
# Second batch (URLs 11-20)
anakin scrape-batch "https://url11.com" ... "https://url20.com" -o batch-2.json
Output format: JSON file with combined results, each URL's status (success/failure), content, metadata, and any errors.
3. AI-Powered Web Search
Run intelligent web searches to find pages, answer questions, and discover sources.
When to use:
- Finding pages on a specific topic
- Answering questions with web sources
- Discovering relevant sources for research
- Gathering links before scraping specific pages
- Quick factual lookups
Basic usage:
# AI-powered web search
anakin search "your search query here" -o search-results.json
Follow-up workflow:
# 1. Search for relevant pages
anakin search "machine learning tutorials" -o search-results.json
# 2. Scrape specific results for full content
anakin scrape "https://result-url-from-search.com" -o page.md
Output format: JSON file with search results including titles, URLs, snippets, relevance scores, and metadata.
4. Deep Agentic Research
Run comprehensive autonomous research that explores the web and returns detailed reports.
When to use:
- Comprehensive research on complex topics
- Market analysis requiring multiple sources
- Technical deep-dives across documentation and articles
- Comparison research (products, technologies, approaches)
- Questions requiring synthesis from many sources
Basic usage:
# Deep agentic research (takes 1-5 minutes)
anakin research "your research topic or question" -o research-report.json
# With extended timeout for complex topics
anakin research "comprehensive analysis of quantum computing" --timeout 600 -o research-report.json
⏱️ Important: Deep research takes 1-5 minutes and runs autonomously. Always inform the user about this duration before starting.
What it does:
- Autonomously searches for relevant sources
- Scrapes and analyzes multiple pages
- Synthesizes information across sources
- Generates comprehensive reports with citations
- Provides key insights and conclusions
Output format: JSON file with executive summary, detailed report by subtopics, key insights, citations with URLs, confidence scores, and related topics.
Decision Guide
Use anakin scrape when:
- You have a single specific URL to extract
- You need content in markdown, JSON, or raw format
- The page is static or JavaScript-heavy (use
--browser)
Use anakin scrape-batch when:
- You have 2-10 URLs to scrape simultaneously
- You need efficient parallel processing
- You want combined results in one file
Use anakin search when:
- You need to find relevant URLs first
- You want quick factual lookups
- You need results in under 30 seconds
- You know what you're looking for
Use anakin research when:
- You need comprehensive analysis across 5+ sources
- The topic is complex and requires deep exploration
- You want a synthesized report with insights
- You can wait 1-5 minutes for autonomous research
- The question requires comparing multiple perspectives
Guardrails
URL Handling
- Always quote URLs to prevent shell interpretation of
?,&,#characters - Example:
anakin scrape "https://example.com?param=value"notanakin scrape https://example.com?param=value
Output Management
- Always use
-o <file>to save output to a file rather than flooding the terminal - Choose appropriate output filenames based on content type
Format Selection
- Default to markdown for readability unless user explicitly asks for JSON or raw
- Use
--format jsonfor structured data processing - Use
--format rawfor full API response with HTML
Special Cases
- Use
--browseronly when standard scrape returns empty or incomplete content - For batch scraping: Maximum 10 URLs per command — split larger lists
- For research: Always warn about 1-5 minute duration before starting
Rate Limiting
- On HTTP 429 errors (rate limit), wait before retrying
- Do not loop immediately on rate limit errors
Authentication
- On HTTP 401 errors, re-run
anakin loginrather than retrying the same command
Error Handling
| Error | Solution |
|---|---|
| HTTP 401 (Unauthorized) | Re-run anakin login --api-key "your-key" |
| HTTP 429 (Rate Limited) | Wait before retrying, do not loop immediately |
| Empty content | Try adding --browser flag for JavaScript-heavy sites |
| Timeout | Increase with --timeout <seconds> for slow pages |
| Batch partial failure | Check output JSON for individual statuses, retry failed URLs with --browser |
| Research fails | Fall back to search + multiple scrape calls manually |
Output Formats
Markdown (default for scrape)
- Clean, readable text stripped of navigation and ads
- Best for human reading and summarization
- File extension:
.md
JSON (structured)
- Structured data with title, content, metadata
- Best for processing and parsing
- File extension:
.json
Raw (full response)
- Full API response including HTML, links, images, metadata
- Best for debugging or accessing all available data
- File extension:
.json
Examples
Example 1: Article extraction
anakin scrape "https://blog.example.com/article" -o article.md
Example 2: Product comparison
anakin scrape-batch "https://store1.com/product" "https://store2.com/product" "https://store3.com/product" -o products.json
Example 3: Find and scrape
# Step 1: Find relevant URLs
anakin search "best coffee shops in Seattle" -o coffee-search.json
# Step 2: Scrape the top results
anakin scrape-batch "url1" "url2" "url3" -o coffee-details.json
Example 4: Market research
anakin research "market trends in electric vehicle adoption 2024-2026" -o ev-research.json
Example 5: JavaScript-heavy site
anakin scrape "https://spa-application.com" --browser -o spa-content.md
Example 6: Geo-targeted content
anakin scrape "https://news-site.com" --country us -o us-news.md
anakin scrape "https://news-site.com" --country gb -o gb-news.md
Best Practices
- Start simple: Try basic scrape first, add flags only if needed
- Be specific: Use clear, specific search queries and research topics
- Quote URLs: Always wrap URLs in quotes
- Save output: Always use
-oflag to save results to files - Check status: Run
anakin statusbefore starting work - Batch wisely: Group similar URLs together, max 10 per batch
- Wait on rate limits: Don't retry immediately on 429 errors
- Choose the right tool:
- Single page →
scrape - Multiple pages →
scrape-batch - Don't have URLs →
searchfirst - Need deep analysis →
research
- Single page →
Troubleshooting
Authentication issues
# Check status
anakin status
# Re-authenticate
anakin login --api-key "ak-your-key-here"
Empty or incomplete content
- Add
--browserflag for JavaScript-heavy sites - Increase timeout with
--timeout 300 - Check if the site requires specific geo-location with
--country <code>
Rate limiting
- Wait before retrying (don't loop immediately)
- Consider spacing out requests for large batch operations
- Check your API plan limits at anakin.io/dashboard
Resources
- Anakin Website
- Anakin Dashboard - Get API keys and check usage
- anakin-cli on PyPI
- Support