Anysite CLI
Command-line tool for web data extraction, dataset pipelines, and database operations.
Agent Planning Workflow
BEFORE planning any data collection task, follow this sequence:
-
Discover available endpoints
anysite describe --search "<keyword>" # Search by domain (linkedin, company, user, etc.) -
Select endpoints needed for the task — identify which endpoints will provide the required data
-
Inspect each selected endpoint
anysite describe /api/linkedin/company # View input params and output fields -
Only then plan — now you know the exact parameters, field names, and data structure to build your config or API calls
This prevents errors from wrong endpoint paths, missing required parameters, or incorrect field names in dependencies.
Best Practices
-
Use dataset pipelines for multi-step tasks
- If a task requires sequential API calls, LLM enrichment, or chained data processing — create a
dataset.yamlconfig instead of running multiple ad-hoc commands - Dataset pipelines handle dependencies, incremental collection, and error recovery automatically
- Even for "simple" tasks that grow in scope, a dataset config is easier to maintain
- Benefits: run history, incremental sync, scheduling, notifications, DB loading
- If a task requires sequential API calls, LLM enrichment, or chained data processing — create a
-
Save data in Parquet format by default — unless user requests another format or CSV/JSON fits better
-
Prefer datasets over ad-hoc scripts — one dataset.yaml replaces dozens of shell commands
Quick Start Checklist
Before any data collection task:
# 1. Check CLI is available and see latest changes
anysite --version
anysite changelog --last 1 --json # Check what's new in this version
# If not found: source .venv/bin/activate or pip install anysite-cli
# 2. Update schema cache (required for endpoint discovery)
anysite schema update
# 3. Verify API key
anysite config get api_key
# If not set: anysite config set api_key sk-xxxxx
After upgrading, run anysite changelog --since <old_version> --json to discover new features.
Endpoint Discovery
ALWAYS discover endpoints before writing API calls or dataset configs:
anysite describe # List all endpoints
anysite describe --search "company" # Search with dependency context
anysite describe /api/linkedin/company # Full details: params, output, connections
Search returns matched endpoints plus upstream providers (who can supply input IDs) and downstream consumers (who can use output IDs). Use this to plan endpoint chains for dataset pipelines.
Input params show type, description, examples, and defaults. Array params show item structure:
Input parameters:
* urn string User URN, only fsd_profile urn type is allowed
example: "urn:li:fsd_profile:ACoAABXy1234"
count integer Number of posts to return
default: 20
companies array[object{type,value}] Company URNs
example: [{"type": "company", "value": "14064608"}]
Prerequisites
pip install "anysite-cli[data]" # DuckDB + PyArrow for dataset commands
pip install "anysite-cli[llm]" # LLM analysis (openai/anthropic)
pip install "anysite-cli[postgres]" # PostgreSQL adapter
pip install "anysite-cli[clickhouse]" # ClickHouse adapter
anysite config set api_key sk-xxxxx # Configure API key
anysite schema update # Update schema cache
anysite llm setup # Interactive setup (human)
anysite llm setup --provider openai --api-key sk-xxx --no-test # Non-interactive (agent)
anysite llm setup --provider anthropic --api-key-env ANTHROPIC_API_KEY --no-test
anysite db add pg --type postgres --host localhost --database mydb --user app --password secret
# Or via env var: anysite db add pg ... --password-env PGPASS
anysite db add ch --type clickhouse --host ch.example.com --port 8443 --database analytics --user app --password secret --ssl
Authentication
anysite auth login # Interactive browser-based OAuth2 (human)
anysite auth login --force --no-browser # Re-authenticate without confirmation (agent)
anysite auth status # Check current auth status
anysite auth status --json # Machine-readable auth status
anysite auth logout # Interactive logout (human)
anysite auth logout --force # Logout without confirmation (agent)
Single API Call
anysite api /api/linkedin/user user=satyanadella
anysite api /api/linkedin/company company=anthropic --format table
anysite api /api/linkedin/search/users keywords="CTO" count=50 --format csv --output ctos.csv
anysite api /api/linkedin/user user=satyanadella --fields "name,headline,urn.value" -q | jq
URN/Name Parameter Formats
Parameters like location, current_companies, industry accept two formats:
# Single name (text search) — resolves to URNs automatically
location="London"
current_companies="Microsoft"
# Multiple URNs (direct) — use JSON array in single quotes
'location=["urn:li:geo:101165590", "urn:li:geo:101282230"]'
'current_companies=["urn:li:company:1035", "urn:li:company:1441"]'
Note: List of names ["Microsoft", "Google"] is NOT supported — use either one name OR multiple URNs.
Batch Processing
anysite api /api/linkedin/user --from-file users.txt --input-key user \
--parallel 5 --rate-limit "10/s" --on-error skip --progress --stats
Dataset Pipeline (Multi-Source Collection)
For complex data collection with dependencies, LLM enrichment, scheduling — use dataset pipelines.
Initialize
anysite dataset init my-dataset
# Creates my-dataset/dataset.yaml with template config
Six Source Types
- Independent — single API call with static
input - from_file — batch calls iterating over input file values
- Dependent — batch calls using values extracted from a parent source
- Union (type: union) — combine records from multiple parent sources into one
- LLM (type: llm) — process parent data through LLM without API calls
- SQL (type: sql) — query a database connection or dataset Parquet files via DuckDB
Comprehensive Dataset YAML Reference
name: my-dataset # Dataset name (required)
description: Optional description # Human-readable description
sources:
# === TYPE 1: Independent source (single API call) ===
- id: search_results # Unique identifier (required)
endpoint: /api/linkedin/search/users # API endpoint (required for type: api)
input: # Static API parameters
keywords: "software engineer"
count: 50
parallel: 1 # Concurrent requests: 1-10 (default: 1)
rate_limit: "10/s" # Rate limit: "N/s", "N/m", "N/h"
on_error: stop # Error handling: stop | skip (default: stop)
- id: search_extra # Another search (can be combined with union)
endpoint: /api/linkedin/search/users
input: { keywords: "data engineer", count: 50 }
# === TYPE 2: from_file source (batch from file) ===
- id: companies
endpoint: /api/linkedin/company
from_file: companies.txt # Input file: .txt (line per value), .csv, .jsonl
file_field: company_slug # CSV column name (for CSV files only)
input_key: company # API parameter to fill with each value
parallel: 3
# === TYPE 3: Dependent source (values from parent) ===
- id: employees
endpoint: /api/linkedin/company/employees
dependency:
from_source: companies # Parent source ID (required)
field: urn.value # Dot-notation path to extract from parent records
match_by: name # Alternative: fuzzy match instead of exact field
dedupe: true # Remove duplicate values (default: false)
input_key: companies # API parameter for extracted values
input_template: # Transform values before API call
companies:
- type: company
value: "{value}" # {value} = extracted value placeholder
count: 5
refresh: auto # Incremental behavior: auto (default) | always | never
# never = skip if data exists
# === Shorthand for dependent sources ===
# ${source.field} auto-expands to dependency + input_key:
- id: profiles
endpoint: /api/linkedin/user
input:
user: ${companies.urn.value} # Equivalent to dependency + input_key above
# === TYPE 4: Union source (combine multiple sources) ===
- id: all_search_results
type: union # Source type: api (default) | union | llm | sql
sources: [search_results, search_extra] # Parent source IDs to combine (required)
dedupe_by: urn.value # Optional: field path for deduplication (dot-notation)
# NOTE: type: union cannot have endpoint, dependency, from_file, input_key, input
# NOTE: all sources in the list must have the same endpoint (same data structure)
# Records are annotated with _union_source = parent source ID
# === TYPE 5: LLM source (process parent data without API) ===
- id: employees_analyzed
type: llm # Source type: api (default) | union | llm | sql
dependency:
from_source: employees
field: name # Required by schema (not used for LLM sources)
llm: # LLM enrichment steps (required for type: llm)
- type: classify # Step types: classify | enrich | summarize | generate
categories: "developer,recruiter,executive" # Comma-separated (omit for auto-detect)
output_column: role_type # Output column name (default: category)
fields: [headline] # Record fields to include in LLM prompt
- type: enrich
add: # Field specs (required for enrich)
- "sentiment:positive/negative/neutral" # Enum: value1/value2/value3
- "language:string" # Types: string | number | integer | boolean
- "quality_score:1-10" # Range: min-max
fields: [headline, summary]
temperature: 0.0 # LLM temperature: 0.0-1.0 (default: 0.0)
provider: openai # Provider override: openai | anthropic
model: gpt-4o-mini # Model override
- type: summarize
max_length: 50 # Max words (default: 100)
output_column: bio
- type: generate
prompt: "Write pitch for {name}" # Template with {field} placeholders (required)
output_column: pitch
temperature: 0.7 # Higher for creative text
export: # Export destinations (runs after Parquet write)
- type: file
path: ./output/{{source}}-{{date}}.csv # Templates: {{date}}, {{source}}, {{dataset}}
format: csv # Format: json | jsonl | csv
# NOTE: type: llm cannot have endpoint, from_file, input_key, input
# === TYPE 6: SQL source (query a database) ===
# Mode A: Named connection (query external database)
- id: billing_users
type: sql
connection: billing # Named connection from 'anysite db add'
query: "SELECT name, email FROM subscriptions WHERE status = 'inactive'"
# query_file: queries/inactive.sql # Alternative: read SQL from file
filter: ".email != null" # All base fields work: filter, llm, transform, export, db_load
# Mode B: Dataset views (no connection — query Parquet via DuckDB)
# Each collected source becomes a view by its id (hyphens → underscores)
- id: enriched_profiles
type: sql
query: |
SELECT u.*, c.description as company_desc
FROM user_profiles u
LEFT JOIN company_profiles c ON u._input_value = c._input_value
# Use for cross-source JOINs within the pipeline — no external DB needed
# NOTE: type: sql cannot have endpoint, from_file, input_key, input, parallel, rate_limit, on_error
# === OPTIONAL BLOCKS (any source type) ===
# THREE-LEVEL FILTERING:
# Level 1: source.filter — before LLM + Parquet (saves tokens, drops records entirely)
# Level 2: transform.filter — before exports only (Parquet keeps all records)
# Level 3: db_load.filter — before DB loading (Parquet keeps all records)
# All use same syntax: .field op value, booleans (true/false), and/or
- id: profiles
endpoint: /api/linkedin/user
dependency: { from_source: employees, field: internal_id.value }
input_key: user
filter: '.follower_count > 100' # Level 1: early filter (before LLM + Parquet)
transform: # Level 2: export filter (Parquet keeps all)
filter: '.location != ""' # Safe expression
fields: # Field selection with dot-notation aliases
- name
- urn.value AS urn_id
- headline
add_columns: # Static columns to inject
batch: "q1-2026"
export:
- type: file
path: ./output/profiles.csv
format: csv
- type: webhook
url: https://example.com/hook
headers: { X-Token: abc }
db_load: # Database loading config
filter: '.active == true' # Level 3: DB filter (only active to DB)
table: people # Custom table name (default: source ID)
key: urn.value # Unique key for diff-based incremental sync
sync: full # Sync mode: full (default) | append (no DELETE)
fields: # Fields to load (default: all except _input_value)
- name
- urn.value AS urn_id
- headline
exclude: [_input_value, raw_html] # Fields to skip
storage:
format: parquet # Storage format (only parquet supported)
path: ./data/ # Storage directory (relative to dataset.yaml)
schedule:
cron: "0 9 * * *" # Cron expression for scheduling
notifications:
on_complete:
- url: "https://hooks.slack.com/xxx"
headers: { Authorization: "Bearer token" }
on_failure:
- url: "https://alerts.example.com/fail"
type: union Source Details
The type: union source combines records from multiple parent sources:
- Requires: non-empty
sourceslist (parent source IDs) - Optional:
dedupe_byfield path for removing duplicates (supports dot-notation) - Cannot have:
endpoint,dependency,from_file,input_key,input - Validation: all parent sources must have the same endpoint (same data structure)
- Use case: merge multiple search results before a single dependent source processes them
sources:
- id: search_cto
endpoint: /api/linkedin/search/users
input: { keywords: "CTO fintech", count: 50 }
- id: search_vp
endpoint: /api/linkedin/search/users
input: { keywords: "VP Engineering", count: 50 }
# Union combines all search results
- id: all_candidates
type: union
sources: [search_cto, search_vp]
dedupe_by: urn.value # Remove duplicates by URN
# Single dependent source processes all candidates
- id: profiles
endpoint: /api/linkedin/user
dependency:
from_source: all_candidates
field: urn.value
input_key: user
Records from union sources are annotated with _union_source (the parent source ID they came from).
type: llm Source Details
The type: llm source processes existing parent data without making API calls:
- Requires:
dependency(parent source) and non-emptyllmlist - Cannot have:
endpoint,from_file,input_key,input - Use case: Run LLM enrichment on already-collected data, re-analyze with different prompts
# Collect only the LLM source (reads parent Parquet, applies LLM steps)
anysite dataset collect dataset.yaml --source employees_analyzed
Validate & Collect Commands
anysite dataset validate dataset.yaml # Validate config (catches errors early)
anysite dataset collect dataset.yaml --dry-run # Preview plan
anysite dataset collect dataset.yaml # Run collection
anysite dataset collect dataset.yaml --load-db pg # Collect + auto-load to DB
anysite dataset collect dataset.yaml --incremental # Skip already-collected inputs
anysite dataset collect dataset.yaml --source employees # Single source + dependencies
anysite dataset collect dataset.yaml --no-llm # Skip LLM enrichment steps
anysite dataset collect dataset.yaml --limit 100 # Pilot run: max 100 inputs per source
Incremental Collection (Resume Where You Left Off)
For from_file and dependency sources, anysite tracks collected input values in metadata.json. This enables resuming collection without re-fetching existing data.
Workflow:
# First run — collects all inputs
anysite dataset collect dataset.yaml
# Later: add new items to input file, run with --incremental
anysite dataset collect dataset.yaml --incremental
# → Only new items are collected, existing ones skipped
# Force full re-collection
anysite dataset reset-cursor dataset.yaml
anysite dataset collect dataset.yaml
Per-source refresh option:
sources:
- id: profiles
refresh: auto # (default) respects --incremental
- id: posts
refresh: always # always re-collects, ignores --incremental
# use for time-sensitive data (feeds, activity)
- id: companies
refresh: never # skip if data exists (any snapshot)
Reset cursor:
anysite dataset reset-cursor dataset.yaml # all sources
anysite dataset reset-cursor dataset.yaml --source posts # specific source
Query with DuckDB
anysite dataset query dataset.yaml --sql "SELECT * FROM companies LIMIT 10"
anysite dataset query dataset.yaml --source profiles --fields "name, urn.value AS id"
anysite dataset query dataset.yaml --source profiles --exclude "_input_value,_parent_source"
anysite dataset query dataset.yaml --interactive # SQL shell
anysite dataset stats dataset.yaml --source companies
anysite dataset profile dataset.yaml
Load into Database
anysite dataset load-db dataset.yaml -c pg --drop-existing # Full load
anysite dataset load-db dataset.yaml -c pg # Incremental sync (when db_load.key set)
anysite dataset load-db dataset.yaml -c pg --snapshot 2026-01-15
Compare Snapshots
anysite dataset diff dataset.yaml --source profiles --key urn.value
anysite dataset diff dataset.yaml --source profiles --key urn.value --from 2026-01-30 --to 2026-02-01
Scheduling & History
anysite dataset schedule dataset.yaml --incremental --load-db pg # Generate cron entry
anysite dataset history my-dataset
anysite dataset logs my-dataset --run 42
anysite dataset reset-cursor dataset.yaml # Clear incremental state
Database Operations
# Connection management
anysite db add pg --type postgres --host localhost --database mydb --user app --password secret
anysite db add pg --type postgres --host localhost --database mydb --user app --password-env PGPASS
anysite db add ch --type clickhouse --host ch.example.com --port 8443 --database analytics --user app --password secret --ssl
anysite db add local --type sqlite --path ./data.db
anysite db add replica --type postgres --host replica.example.com --read-only
anysite db list
anysite db test pg
# Data operations
cat data.jsonl | anysite db insert pg --table users --stdin --auto-create
anysite db query pg --sql "SELECT * FROM users" --format table
anysite db query pg --sql "SELECT * FROM users" --format parquet --output users.parquet
anysite db query pg --sql "SELECT * FROM users" --format csv --output "reports/{{date}}/users.csv"
# Pipe API output to database
anysite api /api/linkedin/user user=satyanadella -q --format jsonl \
| anysite db insert pg --table profiles --stdin --auto-create
# Database discovery (schema introspection, sample data, LLM descriptions)
anysite db discover mydb # Discover schema
anysite db discover mydb --with-llm # Add LLM table/column descriptions
anysite db discover mydb --tables users,posts # Filter tables
anysite db discover mydb --exclude-tables _migrations
# View saved catalogs
anysite db catalog # List all catalogs
anysite db catalog mydb # Show full catalog
anysite db catalog mydb --table users # Show specific table
anysite db catalog mydb --json # JSON output for agents
Credentials: --password saves directly in ~/.anysite/connections.yaml, --password-env references an env var. Direct value takes priority. LLM API keys follow the same pattern via anysite llm setup.
Supported databases: SQLite, PostgreSQL, ClickHouse. ClickHouse uses clickhouse-connect driver (HTTP protocol, port 8123 default, 8443 for HTTPS/SSL).
LLM Analysis Commands
Analyze collected dataset records using LLM. Requires anysite llm setup first.
Non-interactive setup (for agents):
anysite llm setup --provider openai --api-key <key> --no-test
anysite llm setup --provider anthropic --api-key-env ANTHROPIC_API_KEY --model claude-sonnet-4-5-20250514 --no-test
anysite llm summarize dataset.yaml --source profiles --fields "name,headline" --format table
anysite llm classify dataset.yaml --source posts --categories "positive,negative,neutral"
anysite llm enrich dataset.yaml --source profiles \
--add "seniority:junior/mid/senior" --add "is_technical:boolean"
anysite llm generate dataset.yaml --source profiles \
--prompt "Write intro for {name} who works as {headline}" --temperature 0.7
anysite llm match dataset.yaml --source-a profiles --source-b companies --top-k 3
anysite llm deduplicate dataset.yaml --source profiles --key name --threshold 0.8
anysite llm cache-stats
anysite llm cache-clear
Use anysite describe --search <keyword> for more endpoints.
Key Patterns
- Output formats:
--format json|jsonl|csv|table|parquet(parquet requires--output) - Array params:
keywords=a,b,cauto-wraps as["a","b","c"] - Field selection:
--fields "name,headline,urn.value"(dot-notation for nested) - Error handling:
--on-error stop|skip|retry - Config priority: CLI args > ENV vars >
~/.anysite/config.yaml> defaults
References
For detailed option tables and advanced configuration:
- api-reference.md — all CLI options
- llm-reference.md — LLM provider config, cache, prompts
- dataset-guide.md — full YAML schema, advanced features