Feature Store Designer
Design and audit feature store configurations for production ML systems. Reviews entity definitions, feature views, data sources, materialization pipelines, online/offline serving, freshness policies, and training-serving skew risks. Works with Feast, Tecton, or custom architectures. Acts as a senior ML platform engineer designing your feature infrastructure.
Usage
Basic: Design a feature store for our recommendation system
Focused: Check entity definitions for consistency | Analyze materialization efficiency | Review online serving latency | Detect training-serving skew
How It Works
Step 1: Discover Feature Store Configuration
find /path/to/project -name "feature_store.yaml" -type f
find /path/to/project -name "*.py" | xargs grep -l "FeatureView\|Entity\|FeatureService"
find /path/to/project -name "*.yaml" -path "*/features/*"
Parses entities, feature views, feature services, data sources, on-demand views, materialization config, online/offline store setup, and registry settings.
Step 2: Audit Entity Definitions
5 entities defined
PASS: user_id (INT64) — consistent type across 8 feature views
FAIL: merchant_id — mixed types (STRING in features, INT in risk scoring)
RISK: Type mismatch causes lookup failures at serving time
FIX: Standardize on STRING, remove cast in merchant_risk.py
FAIL: transaction_id — 500M+ records as entity
Online store must serve 500M+ keys — memory/cost explosion
FIX: Pre-aggregate to user_id level for online serving
FAIL: Entity naming inconsistency: "user_id" vs "merchantId"
FIX: Standardize on snake_case for all entities
Cardinality Summary:
user_id: 10M (manageable) | merchant_id: 500K (low)
transaction_id: 500M (CRITICAL) | session_id: 50M/day (HIGH, TTL required)
Step 3: Review Feature Views
12 feature views + 3 on-demand + 1 stream
user_transaction_features (8 features, user_id, BigQuery source, TTL=24h)
PASS: Reasonable count, TTL set, tags assigned
WARN: "is_high_value_user" is derived business logic
FIX: Move to on-demand feature view (threshold changes without re-materialization)
user_profile_features (15 features, user_id, PostgreSQL source)
FAIL: Too many features — materialization is all-or-nothing
FIX: Split into user_demographics (5), user_preferences (5), user_account (5)
FAIL: "raw_address" is free-text — anti-pattern in feature stores
FIX: Extract address_country, address_state, address_zip_prefix
merchant_risk_scores (4 features, merchant_id)
FAIL: No data source configured — cannot materialize
Step 4: Analyze Feature Services
fraud_detection_v2: 21 features from 4 views (batch + on-demand)
PASS: Reasonable count, good mix of feature types
recommendation_engine: 24 features including 512-dim embedding vector
FAIL: 512 floats per lookup at 10K QPS = ~20 MB/s bandwidth
FIX: Pre-compute dot products or use ANN index instead
FAIL: 4 feature views not in any service — wasted materialization
orphan_view_1, legacy_features, test_features...
FIX: Remove or document purpose
Step 5: Review Data Sources and Freshness
FAIL: PostgreSQL source points at production DB
Materialization queries will degrade user-facing app
FIX: Use read replica or data warehouse copy
WARN: Kafka source has no dead letter queue
Malformed events will crash stream processor
Freshness Analysis:
Feature View | TTL | Materialization | Effective Lag
user_transaction_features | 24h | Daily 2 AM | Up to 26 hours
user_profile_features | 72h | Weekly | Up to 7+ days
merchant_risk_scores | 12h | Every 6 hours | Up to 18 hours
FAIL: user_profile_features — 7-day effective lag
User updates profile, models see old data for a week
FIX: Daily materialization or CDC stream
WARN: Materialization at 2 AM but upstream loads at 3 AM
Features always 1 day behind. FIX: Schedule after upstream (4 AM)
Step 6: Audit Online/Offline Stores
Offline: BigQuery (PASS — good PIT join support)
Online: Redis (single instance)
FAIL: No replication — SPOF for all ML models
FIX: Redis Sentinel or Redis Cluster for HA
FAIL: No maxmemory configured
10M users * 21 features = ~1.7 GB minimum, growing unbounded
FIX: Set maxmemory 8gb, policy allkeys-lru
WARN: No read-through cache — add 5-min app cache to reduce load 80%
Registry: SQL (PostgreSQL) — PASS
WARN: No access controls — any team can modify any definition
WARN: No staging environment for testing new feature views
Step 7: Detect Training-Serving Skew
FAIL: Feature computation differs between training and serving
Training: SQL in notebook with different timestamp handling
Serving: Feast materialization job with different SQL
FIX: Use Feast get_historical_features() for training data
FAIL: On-demand "time_since_last_transaction"
Training: computed as days (float). Serving: seconds (int)
86400x scale difference — predictions will be wrong
FIX: Standardize unit in transform
WARN: "weekend_transaction_ratio" — training uses UTC, serving uses local TZ
WARN: 2 features have >5% NULL rate online but model trained on complete data
FIX: Add default values via feature_view.with_default_values()
Skew parity score: 40/100
Step 8: Review Materialization
FAIL: Full materialization every cycle
10M users * 12 views = 120M computations/day
FIX: feast materialize-incremental — process only new/changed entities
FAIL: No materialization monitoring
Stale features served silently until TTL expires
FIX: Alert on job failure, >2x duration, freshness exceeding threshold
Cost breakdown:
user_transaction_features: 45 min, $2.30/run
session_features: 2h 10m, $8.50/run (72% of total cost)
Daily total: $11.85 | Monthly: ~$355
FIX: Stream processing for session_features to reduce cost
Step 9: Final Report
# Feature Store Design Report
## Overall Health Score: 52/100
Entity design: 6/10 Feature views: 5/10
Feature services: 6/10 Data sources: 4/10
Freshness: 5/10 Online store: 3/10
Training-serving parity: 4/10 Materialization: 4/10
## Critical Issues
1. Training-serving skew — different computation paths
2. Single Redis instance — SPOF for all online serving
3. Materialization hitting production database
4. No materialization failure monitoring
5. Feature unit mismatch (days vs seconds)
## High Priority
6. 7-day staleness for user profiles
7. 500M entity cardinality for transactions
8. No incremental materialization
9. Embedding vectors in feature store (wrong abstraction)
10. 4 orphan feature views wasting compute
Output
- Entity audit with cardinality analysis and type consistency checks
- Feature view analysis covering size, composition, anti-patterns
- Service review for coverage, orphans, serving efficiency
- Freshness analysis with TTL, schedule alignment, effective lag
- Skew detection between training and serving computation
- Infrastructure review for online/offline store configuration and HA
- Cost analysis for materialization compute and storage
- Health score 0-100 with per-category breakdown
Tips for Best Results
- Point the agent at your Feast feature repository root
- Share model training notebooks to detect training-serving skew
- Provide online serving latency requirements for capacity analysis
- Run when designing a new feature store or onboarding a new model
- Combine with mlops-experiment-tracker for full ML platform audit