cm-feature-store-designer

Design and audit feature store configurations for ML systems. Reviews entity definitions, feature views, materialization pipelines, online/offline serving setup, feature freshness, data sources, and registry organization. Works with Feast, Tecton, or custom feature store architectures. Use when asked to design a feature store, audit feature definitions, review materialization strategy, check feature serving, optimize feature pipelines, or plan feature store architecture. Triggers on "feature store", "feast", "tecton", "feature view", "feature engineering", "online features", "offline features", "materialization", "feature serving", "entity", "feature registry", "ml features", "feature pipeline".

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "cm-feature-store-designer" with this command: npx skills add charlie-morrison/feature-store-designer

Feature Store Designer

Design and audit feature store configurations for production ML systems. Reviews entity definitions, feature views, data sources, materialization pipelines, online/offline serving, freshness policies, and training-serving skew risks. Works with Feast, Tecton, or custom architectures. Acts as a senior ML platform engineer designing your feature infrastructure.

Usage

Basic: Design a feature store for our recommendation system Focused: Check entity definitions for consistency | Analyze materialization efficiency | Review online serving latency | Detect training-serving skew

How It Works

Step 1: Discover Feature Store Configuration

find /path/to/project -name "feature_store.yaml" -type f
find /path/to/project -name "*.py" | xargs grep -l "FeatureView\|Entity\|FeatureService"
find /path/to/project -name "*.yaml" -path "*/features/*"

Parses entities, feature views, feature services, data sources, on-demand views, materialization config, online/offline store setup, and registry settings.

Step 2: Audit Entity Definitions

  5 entities defined

  PASS: user_id (INT64) — consistent type across 8 feature views
  FAIL: merchant_id — mixed types (STRING in features, INT in risk scoring)
    RISK: Type mismatch causes lookup failures at serving time
    FIX: Standardize on STRING, remove cast in merchant_risk.py

  FAIL: transaction_id — 500M+ records as entity
    Online store must serve 500M+ keys — memory/cost explosion
    FIX: Pre-aggregate to user_id level for online serving

  FAIL: Entity naming inconsistency: "user_id" vs "merchantId"
    FIX: Standardize on snake_case for all entities

  Cardinality Summary:
    user_id: 10M (manageable) | merchant_id: 500K (low)
    transaction_id: 500M (CRITICAL) | session_id: 50M/day (HIGH, TTL required)

Step 3: Review Feature Views

  12 feature views + 3 on-demand + 1 stream

  user_transaction_features (8 features, user_id, BigQuery source, TTL=24h)
    PASS: Reasonable count, TTL set, tags assigned
    WARN: "is_high_value_user" is derived business logic
      FIX: Move to on-demand feature view (threshold changes without re-materialization)

  user_profile_features (15 features, user_id, PostgreSQL source)
    FAIL: Too many features — materialization is all-or-nothing
      FIX: Split into user_demographics (5), user_preferences (5), user_account (5)
    FAIL: "raw_address" is free-text — anti-pattern in feature stores
      FIX: Extract address_country, address_state, address_zip_prefix

  merchant_risk_scores (4 features, merchant_id)
    FAIL: No data source configured — cannot materialize

Step 4: Analyze Feature Services

  fraud_detection_v2: 21 features from 4 views (batch + on-demand)
    PASS: Reasonable count, good mix of feature types

  recommendation_engine: 24 features including 512-dim embedding vector
    FAIL: 512 floats per lookup at 10K QPS = ~20 MB/s bandwidth
      FIX: Pre-compute dot products or use ANN index instead

  FAIL: 4 feature views not in any service — wasted materialization
    orphan_view_1, legacy_features, test_features...
    FIX: Remove or document purpose

Step 5: Review Data Sources and Freshness

  FAIL: PostgreSQL source points at production DB
    Materialization queries will degrade user-facing app
    FIX: Use read replica or data warehouse copy

  WARN: Kafka source has no dead letter queue
    Malformed events will crash stream processor

  Freshness Analysis:
    Feature View              | TTL   | Materialization | Effective Lag
    user_transaction_features | 24h   | Daily 2 AM      | Up to 26 hours
    user_profile_features     | 72h   | Weekly           | Up to 7+ days
    merchant_risk_scores      | 12h   | Every 6 hours    | Up to 18 hours

  FAIL: user_profile_features — 7-day effective lag
    User updates profile, models see old data for a week
    FIX: Daily materialization or CDC stream

  WARN: Materialization at 2 AM but upstream loads at 3 AM
    Features always 1 day behind. FIX: Schedule after upstream (4 AM)

Step 6: Audit Online/Offline Stores

  Offline: BigQuery (PASS — good PIT join support)

  Online: Redis (single instance)
    FAIL: No replication — SPOF for all ML models
      FIX: Redis Sentinel or Redis Cluster for HA
    FAIL: No maxmemory configured
      10M users * 21 features = ~1.7 GB minimum, growing unbounded
      FIX: Set maxmemory 8gb, policy allkeys-lru
    WARN: No read-through cache — add 5-min app cache to reduce load 80%

  Registry: SQL (PostgreSQL) — PASS
    WARN: No access controls — any team can modify any definition
    WARN: No staging environment for testing new feature views

Step 7: Detect Training-Serving Skew

  FAIL: Feature computation differs between training and serving
    Training: SQL in notebook with different timestamp handling
    Serving: Feast materialization job with different SQL
    FIX: Use Feast get_historical_features() for training data

  FAIL: On-demand "time_since_last_transaction"
    Training: computed as days (float). Serving: seconds (int)
    86400x scale difference — predictions will be wrong
    FIX: Standardize unit in transform

  WARN: "weekend_transaction_ratio" — training uses UTC, serving uses local TZ
  WARN: 2 features have >5% NULL rate online but model trained on complete data
    FIX: Add default values via feature_view.with_default_values()

  Skew parity score: 40/100

Step 8: Review Materialization

  FAIL: Full materialization every cycle
    10M users * 12 views = 120M computations/day
    FIX: feast materialize-incremental — process only new/changed entities

  FAIL: No materialization monitoring
    Stale features served silently until TTL expires
    FIX: Alert on job failure, >2x duration, freshness exceeding threshold

  Cost breakdown:
    user_transaction_features: 45 min, $2.30/run
    session_features: 2h 10m, $8.50/run (72% of total cost)
    Daily total: $11.85 | Monthly: ~$355
    FIX: Stream processing for session_features to reduce cost

Step 9: Final Report

# Feature Store Design Report

## Overall Health Score: 52/100
  Entity design: 6/10        Feature views: 5/10
  Feature services: 6/10     Data sources: 4/10
  Freshness: 5/10            Online store: 3/10
  Training-serving parity: 4/10  Materialization: 4/10

## Critical Issues
  1. Training-serving skew — different computation paths
  2. Single Redis instance — SPOF for all online serving
  3. Materialization hitting production database
  4. No materialization failure monitoring
  5. Feature unit mismatch (days vs seconds)

## High Priority
  6. 7-day staleness for user profiles
  7. 500M entity cardinality for transactions
  8. No incremental materialization
  9. Embedding vectors in feature store (wrong abstraction)
  10. 4 orphan feature views wasting compute

Output

  • Entity audit with cardinality analysis and type consistency checks
  • Feature view analysis covering size, composition, anti-patterns
  • Service review for coverage, orphans, serving efficiency
  • Freshness analysis with TTL, schedule alignment, effective lag
  • Skew detection between training and serving computation
  • Infrastructure review for online/offline store configuration and HA
  • Cost analysis for materialization compute and storage
  • Health score 0-100 with per-category breakdown

Tips for Best Results

  • Point the agent at your Feast feature repository root
  • Share model training notebooks to detect training-serving skew
  • Provide online serving latency requirements for capacity analysis
  • Run when designing a new feature store or onboarding a new model
  • Combine with mlops-experiment-tracker for full ML platform audit

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

MLOps Validation CN

Rigorous validation with typing, linting, testing, and security

Registry SourceRecently Updated
5190Profile unavailable
Security

AI Skill Maintainer (EN)

AI公司 Skill 维护工作流(CTO 版本govern + CISO security运营standard版)。当需要对已publish的 Skill 进行版本update、bug修复、Function增强、依赖upgrade、security补丁、废弃(deprecation)manage时trigger。...

Registry SourceRecently Updated
1330Profile unavailable
Security

AI Skill Optimizer (EN)

AI公司 Skill optimize工作流(CTO 性能工程 + CISO securityoptimizestandard版)。当需要对现有 Skill 进行性能optimize、Token 节省、上下文精简、security加固、代码重构、质量enhance时trigger。trigger关键词:optim...

Registry SourceRecently Updated
1420Profile unavailable
Security

AI Skill Creator (EN)

AI公司 Skill 创作工作流(CTO MLOps + CISO securitystandard版)。当需要从头create新 Skill(包括初始化目录结构、编写 SKILL.md、引用文件、脚本资源、securityreview、quality gate)时使用。trigger关键词:createSkil...

Registry SourceRecently Updated
1540Profile unavailable