Initialize Warehouse Schema
Generate a comprehensive, user-editable schema reference file for the data warehouse.
What This Does
-
Discovers all databases, schemas, tables, and columns from the warehouse
-
Enriches with codebase context (dbt models, gusty SQL, schema docs)
-
Records row counts and identifies large tables
-
Generates .astro/warehouse.md
-
a version-controllable, team-shareable reference
-
Enables instant concept→table lookups without warehouse queries
Process
Step 1: Read Warehouse Configuration
Read ~/.astro/ai/config/warehouse.yml to get configured databases
Example config has: databases: [HQ, ANALYTICS, RAW]
Use list_schemas() with no database argument to see all configured databases.
Step 2: Search Codebase for Context (Parallel)
Launch a subagent to find business context in code:
Task( subagent_type="Explore", prompt=""" Search for data model documentation in the codebase:
1. dbt models: **/models/**/*.yml, **/schema.yml
- Extract table descriptions, column descriptions
- Note primary keys and tests
2. Gusty/declarative SQL: **/dags/**/*.sql with YAML frontmatter
- Parse frontmatter for: description, primary_key, tests
- Note schema mappings
3. AGENTS.md or CLAUDE.md files with data layer documentation
Return a mapping of:
table_name -> {description, primary_key, important_columns, layer}
"""
)
Step 3: Parallel Warehouse Discovery
Launch one subagent per database using the Task tool:
For each database in configured_databases: Task( subagent_type="general-purpose", prompt=""" Discover all metadata for database {DATABASE}:
1. Call list_schemas(database="{DATABASE}")
2. For each schema returned, call list_tables(database="{DATABASE}", schema=X)
3. For tables with interesting names or high row counts,
call get_tables_info(database="{DATABASE}", schema=X, tables=[...])
Return a structured summary:
- Database name
- List of schemas with table counts
- For each table: name, row_count, columns (if fetched)
- Flag any tables with >100M rows as "large"
Focus on MODEL_*, METRICS_*, MART_* schemas first as these are most useful.
"""
)
Run all subagents in parallel (single message with multiple Task calls).
Step 4: Discover Categorical Value Families
For key categorical columns (like OPERATOR, STATUS, TYPE, FEATURE), discover value families to help with filtering:
-- Find distinct values and group into families SELECT DISTINCT column_name, COUNT(*) as occurrences FROM table WHERE column_name IS NOT NULL GROUP BY column_name ORDER BY occurrences DESC LIMIT 50
Group related values into families by common prefix/suffix (e.g., Export* for ExportCSV, ExportJSON, ExportParquet).
Step 5: Merge Results
Combine warehouse metadata + codebase context:
-
Quick Reference table - concept → table mappings (pre-populated from code if found)
-
Categorical Columns - value families for key filter columns
-
Database sections - one per database
-
Schema subsections - tables grouped by schema
-
Table details - columns, row counts, descriptions from code, warnings
Step 6: Generate warehouse.md
Write the file to:
-
.astro/warehouse.md (default - project-specific, version-controllable)
-
~/.astro/ai/config/warehouse.md (if --global flag)
Output Format
Warehouse Schema
Generated by
/data:initon {DATE}. Edit freely to add business context.
Quick Reference
| Concept | Table | Key Column | Date Column |
|---|---|---|---|
| customers | HQ.MODEL_ASTRO.ORGANIZATIONS | ORG_ID | CREATED_AT |
| <!-- Add your concept mappings here --> |
Categorical Columns
When filtering on these columns, explore value families first (values often have variants):
| Table | Column | Value Families |
|---|---|---|
| {TABLE} | {COLUMN} | {PREFIX}* ({VALUE1}, {VALUE2}, ...) |
| <!-- Populated by /data:init from actual warehouse data --> |
Data Layer Hierarchy
Query downstream first: reporting > mart_* > metric_* > model_* > IN_*
| Layer | Prefix | Purpose |
|---|---|---|
| Reporting | reporting.* | Dashboard-optimized |
| Mart | mart_* | Combined analytics |
| Metric | metric_* | KPIs at various grains |
| Model | model_* | Cleansed sources of truth |
| Raw | IN_* | Source data - avoid |
{DATABASE} Database
{SCHEMA} Schema
{TABLE_NAME}
{DESCRIPTION from code if found}
| Column | Type | Description |
|---|---|---|
| COL1 | VARCHAR | {from code or inferred} |
- Rows: {ROW_COUNT}
- Key column: {PRIMARY_KEY from code or inferred} {IF ROW_COUNT > 100M: - ⚠️ WARNING: Large table - always add date filters}
Relationships
{Inferred relationships based on column names like *_ID}
Command Options
Option Effect
/data:init
Generate .astro/warehouse.md
/data:init --refresh
Regenerate, preserving user edits
/data:init --database HQ
Only discover specific database
/data:init --warehouse prod
Use specific warehouse from config
/data:init --global
Write to ~/.astro/ai/config/ instead
/data:init --no-code
Skip codebase search
Multi-Warehouse Support
When warehouse.yml has multiple warehouses:
prod: type: snowflake databases: [HQ, ANALYTICS]
staging: type: snowflake databases: [HQ_STAGING]
Default behavior: discover the first/default warehouse. Use --warehouse NAME to specify which one.
For separate files per warehouse: --warehouse prod --output warehouse-prod.md
Step 7: Pre-populate Cache
After generating warehouse.md, automatically populate the runtime cache with all Quick Reference entries:
For each row in Quick Reference table: learn_concept( concept=row.concept, table=row.table, key_column=row.key_column, date_column=row.date_column )
This enables instant lookup_concept() results without reading warehouse.md.
Step 8: Offer CLAUDE.md Integration (Ask User)
Ask the user:
Would you like to add the Quick Reference table to your CLAUDE.md file?
This ensures the schema mappings are always in context for data queries, improving accuracy from ~25% to ~100% for complex queries.
Options:
-
Yes, add to CLAUDE.md (Recommended) - Append Quick Reference section
-
No, skip - Use warehouse.md and cache only
If user chooses Yes:
-
Check if .claude/CLAUDE.md or CLAUDE.md exists
-
If exists, append the Quick Reference section (avoid duplicates)
-
If not exists, create .claude/CLAUDE.md with just the Quick Reference
Quick Reference section to add:
Data Warehouse Quick Reference
When querying the warehouse, use these table mappings:
| Concept | Table | Key Column | Date Column |
|---|---|---|---|
| {rows from warehouse.md Quick Reference} |
Large tables (always filter by date): {list tables with >100M rows}
Auto-generated by
/data:init. Run/data:init --refreshto update.
After Generation
Tell the user:
Generated .astro/warehouse.md
Summary:
- {N} databases
- {N} schemas
- {N} tables
- {N} columns
- {N} tables enriched with code descriptions
- {N} concepts cached for instant lookup
You can now:
- Edit .astro/warehouse.md to add business context
- Fill in the Quick Reference table with concept mappings
- Commit it to your repo for team sharing
- Run /data:init --refresh when schema changes
Refresh Behavior
When --refresh is specified:
-
Read existing warehouse.md
-
Preserve all HTML comments (<!-- ... --> )
-
Preserve Quick Reference table entries (user-added)
-
Preserve user-added descriptions
-
Update row counts and add new tables
-
Mark removed tables with <!-- REMOVED --> comment
Cache Staleness & Schema Drift
The runtime cache has a 7-day TTL by default. After 7 days, cached entries expire and will be re-discovered on next use.
When to Refresh
Run /data:init --refresh when:
-
Schema changes: Tables added, renamed, or removed
-
Column changes: New columns added or types changed
-
After deployments: If your data pipeline deploys schema migrations
-
Weekly: As a good practice, even if no known changes
Signs of Stale Cache
Watch for these indicators:
-
Queries fail with "table not found" errors
-
Results seem wrong or outdated
-
New tables aren't being discovered
Manual Cache Reset
If you suspect cache issues:
Check cache status
cache_status()
Clear stale entries (older than 7 days)
clear_cache(cache_type="all", purge_stale_only=True)
Full reset
clear_cache(cache_type="all")
Then run /data:init --refresh to repopulate.
Codebase Patterns Recognized
Pattern Source What We Extract
/models//*.yml
dbt table/column descriptions, tests
**/schema.yml
dbt table relationships
/dags//*.sql
gusty YAML frontmatter (description, primary_key)
AGENTS.md , CLAUDE.md
docs data layer hierarchy, conventions
/docs//*.md
docs business context
Example Session
User: /data:init
Agent: → Reading warehouse configuration... → Found 1 warehouse with databases: HQ, PRODUCT
→ Searching codebase for data documentation... Found: AGENTS.md with data layer hierarchy Found: 45 SQL files with YAML frontmatter in dags/declarative/
→ Launching parallel warehouse discovery... [Database: HQ] Discovering schemas... [Database: PRODUCT] Discovering schemas...
→ HQ: Found 29 schemas, 401 tables → PRODUCT: Found 1 schema, 0 tables
→ Merging warehouse metadata with code context... Enriched 45 tables with descriptions from code
→ Generated .astro/warehouse.md
Summary:
- 2 databases
- 30 schemas
- 401 tables
- 45 tables enriched with code descriptions
- 8 large tables flagged (>100M rows)
Next steps:
- Review .astro/warehouse.md
- Add concept mappings to Quick Reference
- Commit to version control
- Run /data:init --refresh when schema changes