MLOps Patterns
Operationalize machine learning models from experimentation to production deployment and monitoring.
Purpose
Provide strategic guidance for ML engineers and platform teams to build production-grade ML infrastructure. Cover the complete lifecycle: experiment tracking, model registry, feature stores, deployment patterns, pipeline orchestration, and monitoring.
When to Use This Skill
Use this skill when:
- Designing MLOps infrastructure for production ML systems
- Selecting experiment tracking platforms (MLflow, Weights & Biases, Neptune)
- Implementing feature stores for online/offline feature serving
- Choosing model serving solutions (Seldon Core, KServe, BentoML, TorchServe)
- Building ML pipelines for training, evaluation, and deployment
- Setting up model monitoring and drift detection
- Establishing model governance and compliance frameworks
- Optimizing ML inference costs and performance
- Migrating from notebooks to production ML systems
- Implementing continuous training and automated retraining
Core Concepts
1. Experiment Tracking
Track experiments systematically to ensure reproducibility and collaboration.
Key Components:
- Parameters: Hyperparameters logged for each training run
- Metrics: Performance measures tracked over time (accuracy, loss, F1)
- Artifacts: Model weights, plots, datasets, configuration files
- Metadata: Tags, descriptions, Git commit SHA, environment details
Platform Comparison:
MLflow (Open-source standard):
- Framework-agnostic (PyTorch, TensorFlow, scikit-learn, XGBoost)
- Self-hosted or cloud-agnostic deployment
- Integrated model registry
- Basic UI, adequate for most use cases
- Free, requires infrastructure management
Weights & Biases (SaaS, collaboration-focused):
- Advanced visualization and dashboards
- Integrated hyperparameter optimization (Sweeps)
- Excellent team collaboration features
- SaaS pricing scales with usage
- Best-in-class UI
Neptune.ai (Enterprise-grade):
- Enterprise features (RBAC, audit logs, compliance)
- Integrated production monitoring
- Higher cost than W&B
- Good for regulated industries
Selection Criteria:
- Open-source requirement → MLflow
- Team collaboration critical → Weights & Biases
- Enterprise compliance (RBAC, audits) → Neptune.ai
- Hyperparameter optimization primary → Weights & Biases (Sweeps)
For detailed comparison and decision framework, see references/experiment-tracking.md.
2. Model Registry and Versioning
Centralize model artifacts with version control and stage management.
Model Registry Components:
- Model artifacts (weights, serialized models)
- Training metrics (accuracy, F1, AUC)
- Hyperparameters used during training
- Training dataset version
- Feature schema (input/output signatures)
- Model cards (documentation, use cases, limitations)
Stage Management:
- None: Newly registered model
- Staging: Testing in pre-production environment
- Production: Serving live traffic
- Archived: Deprecated, retained for compliance
Versioning Strategies:
Semantic Versioning for Models:
- Major version (v2.0.0): Breaking change in input/output schema
- Minor version (v1.1.0): New feature, backward-compatible
- Patch version (v1.0.1): Bug fix, model retrained on new data
Git-Based Versioning:
- Model code in Git (training scripts, configuration)
- Model weights in DVC (Data Version Control) or Git-LFS
- Reproducibility via commit SHA + data version hash
For model lineage tracking and registry patterns, see references/model-registry.md.
3. Feature Stores
Centralize feature engineering to ensure consistency between training and inference.
Problem Addressed: Training/serving skew
- Training: Features computed with future knowledge (data leakage)
- Inference: Features computed with only past data
- Result: Model performs well in training but fails in production
Feature Store Solution:
Online Feature Store:
- Purpose: Low-latency feature retrieval for real-time inference
- Storage: Redis, DynamoDB, Cassandra (key-value stores)
- Latency: Sub-10ms for feature lookup
- Use Case: Real-time predictions (fraud detection, recommendations)
Offline Feature Store:
- Purpose: Historical feature data for training and batch inference
- Storage: Parquet files (S3/GCS), data warehouses (Snowflake, BigQuery)
- Latency: Seconds to minutes (batch retrieval)
- Use Case: Model training, backtesting, batch predictions
Point-in-Time Correctness:
- Ensures no future data leakage during training
- Feature values at time T only use data available before time T
- Critical for avoiding overly optimistic training metrics
Platform Comparison:
Feast (Open-source, cloud-agnostic):
- Most popular open-source feature store
- Supports Redis, DynamoDB, Datastore (online) and Parquet, BigQuery, Snowflake (offline)
- Cloud-agnostic, no vendor lock-in
- Active community, growing adoption
Tecton (Managed, production-grade):
- Feast-compatible API
- Fully managed service
- Integrated monitoring and governance
- Higher cost, enterprise-focused
SageMaker Feature Store (AWS):
- Integrated with AWS ecosystem
- Managed online/offline stores
- AWS lock-in
Databricks Feature Store (Databricks):
- Unity Catalog integration
- Delta Lake for offline storage
- Databricks ecosystem lock-in
Selection Criteria:
- Open-source, cloud-agnostic → Feast
- Managed solution, production-grade → Tecton
- AWS ecosystem → SageMaker Feature Store
- Databricks users → Databricks Feature Store
For feature engineering patterns and implementation, see references/feature-stores.md.
4. Model Serving Patterns
Deploy models for synchronous, asynchronous, batch, or streaming inference.
Serving Patterns:
REST API Deployment:
- Pattern: HTTP endpoint for synchronous predictions
- Latency: <100ms acceptable
- Use Case: Request-response applications
- Tools: Flask, FastAPI, BentoML, Seldon Core
gRPC Deployment:
- Pattern: High-performance RPC for low-latency inference
- Latency: <10ms target
- Use Case: Microservices, latency-critical applications
- Tools: TensorFlow Serving, TorchServe, Seldon Core
Batch Inference:
- Pattern: Process large datasets offline
- Latency: Minutes to hours acceptable
- Use Case: Daily/hourly predictions for millions of records
- Tools: Spark, Dask, Ray
Streaming Inference:
- Pattern: Real-time predictions on streaming data
- Latency: Milliseconds
- Use Case: Fraud detection, anomaly detection, real-time recommendations
- Tools: Kafka + Flink/Spark Streaming
Platform Comparison:
Seldon Core (Kubernetes-native, advanced):
- Advanced deployment strategies (canary, A/B testing, multi-armed bandits)
- Multi-framework support
- Integrated explainability (Alibi)
- High complexity, steep learning curve
KServe (CNCF standard):
- Standardized InferenceService API
- Serverless scaling (scale-to-zero with Knative)
- Kubernetes-native
- Growing adoption, CNCF backing
BentoML (Python-first, simplicity):
- Easiest to get started
- Excellent developer experience
- Local testing → cloud deployment
- Lower complexity than Seldon/KServe
TorchServe (PyTorch official):
- PyTorch-specific serving
- Production-grade, optimized for PyTorch models
- Less flexible for multi-framework use
TensorFlow Serving (TensorFlow official):
- TensorFlow-specific serving
- Production-grade, optimized for TensorFlow models
- Less flexible for multi-framework use
Selection Criteria:
- Kubernetes, advanced deployments → Seldon Core or KServe
- Python-first, simplicity → BentoML
- PyTorch-specific → TorchServe
- TensorFlow-specific → TensorFlow Serving
- Managed solution → SageMaker/Vertex AI/Azure ML
For model optimization and serving infrastructure, see references/model-serving.md.
5. Deployment Strategies
Deploy models safely with rollback capabilities.
Blue-Green Deployment:
- Two identical environments (Blue: current, Green: new)
- Deploy to Green, test, switch 100% traffic instantly
- Instant rollback (switch back to Blue)
- Trade-off: Requires 2x infrastructure, all-or-nothing switch
Canary Deployment:
- Gradual rollout to subset of traffic
- Route 5% → 10% → 25% → 50% → 100% over time
- Monitor metrics at each stage, rollback if degradation
- Trade-off: Complex routing logic, longer deployment time
Shadow Deployment:
- New model receives traffic but predictions not used
- Compare new model vs old model offline
- Zero risk to production
- Trade-off: Requires 2x compute, delayed feedback
A/B Testing:
- Split traffic between model versions
- Measure business metrics (conversion rate, revenue)
- Statistical significance testing
- Use Case: Optimize for business outcomes, not just ML metrics
Multi-Armed Bandit (MAB):
- Epsilon-greedy: Explore (try new models) vs Exploit (use best model)
- Thompson Sampling: Bayesian approach to exploration
- Use Case: Continuous optimization, faster convergence than A/B
Selection Criteria:
- Low-risk model → Blue-green (instant cutover)
- Medium-risk model → Canary (gradual rollout)
- High-risk model → Shadow (test in production, no impact)
- Business optimization → A/B testing or MAB
For deployment architecture and examples, see references/deployment-strategies.md.
6. ML Pipeline Orchestration
Automate training, evaluation, and deployment workflows.
Training Pipeline Stages:
- Data Validation (Great Expectations, schema checks)
- Feature Engineering (transform raw data)
- Data Splitting (train/validation/test)
- Model Training (hyperparameter tuning)
- Model Evaluation (accuracy, fairness, explainability)
- Model Registration (push to registry if metrics pass thresholds)
- Deployment (promote to staging/production)
Continuous Training Pattern:
- Monitor production data for drift
- Detect data distribution changes (KS test, PSI)
- Trigger automated retraining when drift detected
- Validate new model before deployment
- Deploy via canary or shadow strategy
Platform Comparison:
Kubeflow Pipelines (ML-native, Kubernetes):
- ML-specific pipeline orchestration
- Kubernetes-native (scales with K8s)
- Component-based (reusable pipeline steps)
- Integrated with Katib (hyperparameter tuning)
Apache Airflow (Mature, general-purpose):
- Most mature orchestration platform
- Large ecosystem, extensive integrations
- Python-based DAGs
- Not ML-specific but widely used for ML workflows
Metaflow (Netflix, data science-friendly):
- Human-centric design, easy for data scientists
- Excellent local development experience
- Versioning built-in
- Simpler than Kubeflow/Airflow
Prefect (Modern, Python-native):
- Dynamic workflows, not static DAGs
- Better error handling than Airflow
- Modern UI and developer experience
- Growing community
Dagster (Asset-based, testing-focused):
- Asset-based thinking (not just task dependencies)
- Strong testing and data quality features
- Modern approach, good for data teams
- Smaller community than Airflow
Selection Criteria:
- ML-specific, Kubernetes → Kubeflow Pipelines
- Mature, battle-tested → Apache Airflow
- Data scientists, ease of use → Metaflow
- Software engineers, testing → Dagster
- Modern, simpler than Airflow → Prefect
For pipeline architecture and examples, see references/ml-pipelines.md.
7. Model Monitoring and Observability
Monitor production models for drift, performance, and quality.
Data Drift Detection:
- Definition: Input feature distributions change over time
- Impact: Model trained on old distribution, predictions degrade
- Detection Methods:
- Kolmogorov-Smirnov (KS) Test: Compare distributions
- Population Stability Index (PSI): Measure distribution shift
- Chi-Square Test: For categorical features
- Action: Trigger automated retraining when drift detected
Model Drift Detection:
- Definition: Model prediction quality degrades over time
- Impact: Accuracy, precision, recall decrease
- Detection Methods:
- Ground truth accuracy (delayed labels)
- Prediction distribution changes
- Calibration drift (predicted probabilities vs actual outcomes)
- Action: Alert team, trigger retraining
Performance Monitoring:
- Metrics:
- Latency: P50, P95, P99 inference time
- Throughput: Predictions per second
- Error Rate: Failed predictions / total predictions
- Resource Utilization: CPU, memory, GPU usage
- Alerting Thresholds:
- P95 latency > 100ms → Alert
- Error rate > 1% → Alert
- Accuracy drop > 5% → Trigger retraining
Business Metrics Monitoring:
- Downstream impact: Conversion rate, revenue, user satisfaction
- Model predictions → business outcomes correlation
- Use Case: Optimize models for business value, not just ML metrics
Tools:
- Evidently AI: Data drift, model drift, data quality reports
- Prometheus + Grafana: Performance metrics, custom dashboards
- Arize AI: ML observability platform
- Fiddler: Model monitoring and explainability
For monitoring architecture and implementation, see references/model-monitoring.md.
8. Model Optimization Techniques
Reduce model size and inference latency.
Quantization:
- Convert model weights from float32 to int8
- Model size reduction: 4x smaller
- Inference speed: 2-3x faster
- Accuracy impact: Minimal (<1% degradation typically)
- Tools: PyTorch quantization, TensorFlow Lite, ONNX Runtime
Model Distillation:
- Train small student model to mimic large teacher model
- Transfer knowledge from teacher (BERT-large) to student (DistilBERT)
- Size reduction: 2-10x smaller
- Speed improvement: 2-10x faster
- Use Case: Deploy small model on edge devices, reduce inference cost
ONNX Conversion:
- Convert models to Open Neural Network Exchange (ONNX) format
- Cross-framework compatibility (PyTorch → ONNX → TensorFlow)
- Optimized inference with ONNX Runtime
- Speed improvement: 1.5-3x faster than native framework
Model Pruning:
- Remove less important weights from neural networks
- Sparsity: 30-90% of weights set to zero
- Size reduction: 2-10x smaller
- Accuracy impact: Minimal with structured pruning
For optimization techniques and examples, see references/model-serving.md.
9. LLMOps Patterns
Operationalize Large Language Models with specialized patterns.
LLM Fine-Tuning Pipelines:
- LoRA (Low-Rank Adaptation): Parameter-efficient fine-tuning
- QLoRA: Quantized LoRA (4-bit quantization)
- Pipeline: Base model → Fine-tuning dataset → LoRA adapters → Merged model
- Tools: Hugging Face PEFT, Axolotl
Prompt Versioning:
- Version control for prompts (Git, prompt management platforms)
- A/B testing prompts for quality and cost optimization
- Monitoring prompt effectiveness over time
RAG System Monitoring:
- Retrieval quality: Relevance of retrieved documents
- Generation quality: Answer accuracy, hallucination detection
- End-to-end latency: Retrieval + generation time
- Tools: LangSmith, Arize Phoenix
LLM Inference Optimization:
- vLLM: High-throughput LLM serving
- TensorRT-LLM: NVIDIA-optimized LLM inference
- Text Generation Inference (TGI): Hugging Face serving
- Batching: Dynamic batching for throughput
Embedding Model Management:
- Version embeddings alongside models
- Monitor embedding drift (distribution changes)
- Update embeddings when underlying model changes
For LLMOps patterns and implementation, see references/llmops-patterns.md.
10. Model Governance and Compliance
Establish governance for model risk management and regulatory compliance.
Model Cards:
- Documentation: Model purpose, training data, performance metrics
- Limitations: Known biases, failure modes, out-of-scope use cases
- Ethical considerations: Fairness, privacy, societal impact
- Template: Model Card Toolkit (Google)
Bias and Fairness Detection:
- Measure disparate impact across demographic groups
- Tools: Fairlearn, AI Fairness 360 (IBM)
- Metrics: Demographic parity, equalized odds, calibration
- Mitigation: Reweighting, adversarial debiasing, threshold optimization
Regulatory Compliance:
- EU AI Act: High-risk AI systems require documentation, monitoring
- Model Risk Management (SR 11-7): Banking industry requirements
- GDPR: Right to explanation for automated decisions
- HIPAA: Healthcare data privacy
Audit Trails:
- Log all model versions, training runs, deployments
- Track who approved model transitions (staging → production)
- Retain historical predictions for compliance audits
- Tools: MLflow, Neptune.ai (audit logs)
For governance frameworks and compliance, see references/governance.md.
Decision Frameworks
Framework 1: Experiment Tracking Platform Selection
Decision Tree:
Start with primary requirement:
- Open-source, self-hosted requirement → MLflow
- Team collaboration, advanced visualization (budget available) → Weights & Biases
- Team collaboration, advanced visualization (no budget) → MLflow
- Enterprise compliance (audit logs, RBAC) → Neptune.ai
- Hyperparameter optimization primary use case → Weights & Biases (Sweeps)
Detailed Criteria:
| Criteria | MLflow | Weights & Biases | Neptune.ai |
|---|---|---|---|
| Cost | Free | $200/user/month | $300/user/month |
| Collaboration | Basic | Excellent | Good |
| Visualization | Basic | Excellent | Good |
| Hyperparameter Tuning | External (Optuna) | Integrated (Sweeps) | Basic |
| Model Registry | Included | Add-on | Included |
| Self-Hosted | Yes | No (paid only) | Limited |
| Enterprise Features | No | Limited | Excellent |
Recommendation by Organization:
- Startup (<50 people): MLflow (free, adequate) or W&B (if budget)
- Growth (50-500 people): Weights & Biases (team collaboration)
- Enterprise (>500 people): Neptune.ai (compliance) or MLflow (cost)
For detailed decision framework, see references/decision-frameworks.md.
Framework 2: Feature Store Selection
Decision Matrix:
Primary requirement:
- Open-source, cloud-agnostic → Feast
- Managed solution, production-grade, multi-cloud → Tecton
- AWS ecosystem → SageMaker Feature Store
- GCP ecosystem → Vertex AI Feature Store
- Azure ecosystem → Azure ML Feature Store
- Databricks users → Databricks Feature Store
- Self-hosted with UI → Hopsworks
Criteria Comparison:
| Factor | Feast | Tecton | Hopsworks | SageMaker FS |
|---|---|---|---|---|
| Cost | Free | $$$$ | Free (self-host) | $$$ |
| Online Serving | Redis, DynamoDB | Managed | RonDB | Managed |
| Offline Store | Parquet, BigQuery, Snowflake | Managed | Hive, S3 | S3 |
| Point-in-Time | Yes | Yes | Yes | Yes |
| Monitoring | External | Integrated | Basic | External |
| Cloud Lock-in | No | No | No | AWS |
Recommendation:
- Open-source, self-managed → Feast
- Managed, production-grade → Tecton
- AWS ecosystem → SageMaker Feature Store
- Databricks users → Databricks Feature Store
For detailed decision framework, see references/decision-frameworks.md.
Framework 3: Model Serving Platform Selection
Decision Tree:
Infrastructure:
- Kubernetes-based → Advanced deployment patterns needed?
- Yes → Seldon Core (most features) or KServe (CNCF standard)
- No → BentoML (simpler, Python-first)
- Cloud-native (managed) → Cloud provider?
- AWS → SageMaker Endpoints
- GCP → Vertex AI Endpoints
- Azure → Azure ML Endpoints
- Framework-specific → Framework?
- PyTorch → TorchServe
- TensorFlow → TensorFlow Serving
- Serverless / minimal infrastructure → BentoML or Cloud Functions
Detailed Criteria:
| Feature | Seldon Core | KServe | BentoML | TorchServe |
|---|---|---|---|---|
| Kubernetes-Native | Yes | Yes | Optional | No |
| Multi-Framework | Yes | Yes | Yes | PyTorch-only |
| Deployment Strategies | Excellent | Good | Basic | Basic |
| Explainability | Integrated | Integrated | External | No |
| Complexity | High | Medium | Low | Low |
| Learning Curve | Steep | Medium | Gentle | Gentle |
Recommendation:
- Kubernetes, advanced deployments → Seldon Core or KServe
- Python-first, simplicity → BentoML
- PyTorch-specific → TorchServe
- TensorFlow-specific → TensorFlow Serving
- Managed solution → SageMaker/Vertex AI/Azure ML
For detailed decision framework, see references/decision-frameworks.md.
Framework 4: ML Pipeline Orchestration Selection
Decision Matrix:
Primary use case:
- ML-specific pipelines, Kubernetes-native → Kubeflow Pipelines
- General-purpose orchestration, mature ecosystem → Apache Airflow
- Data science workflows, ease of use → Metaflow
- Modern approach, asset-based thinking → Dagster
- Dynamic workflows, Python-native → Prefect
Criteria Comparison:
| Factor | Kubeflow | Airflow | Metaflow | Dagster | Prefect |
|---|---|---|---|---|---|
| ML-Specific | Excellent | Good | Excellent | Good | Good |
| Kubernetes | Native | Compatible | Optional | Compatible | Compatible |
| Learning Curve | Steep | Steep | Gentle | Medium | Medium |
| Maturity | High | Very High | Medium | Medium | Medium |
| Community | Large | Very Large | Growing | Growing | Growing |
Recommendation:
- ML-specific, Kubernetes → Kubeflow Pipelines
- Mature, battle-tested → Apache Airflow
- Data scientists → Metaflow
- Software engineers → Dagster
- Modern, simpler than Airflow → Prefect
For detailed decision framework, see references/decision-frameworks.md.
Implementation Patterns
Pattern 1: End-to-End ML Pipeline
Automate the complete ML workflow from data to deployment.
Pipeline Stages:
- Data Validation (Great Expectations)
- Feature Engineering (transform raw data)
- Data Splitting (train/validation/test)
- Model Training (with hyperparameter tuning)
- Model Evaluation (accuracy, fairness, explainability)
- Model Registration (push to MLflow registry)
- Deployment (promote to staging/production)
Architecture:
Data Lake → Data Validation → Feature Engineering → Training → Evaluation
↓
Model Registry (staging) → Testing → Production Deployment
For implementation details and code examples, see references/ml-pipelines.md.
Pattern 2: Continuous Training
Automate model retraining based on drift detection.
Workflow:
- Monitor production data for distribution changes
- Detect data drift (KS test, PSI)
- Trigger automated retraining pipeline
- Validate new model (accuracy, fairness)
- Deploy via canary strategy (5% → 100%)
- Monitor new model performance
- Rollback if metrics degrade
Trigger Conditions:
- Scheduled: Daily/weekly retraining
- Data drift: KS test p-value < 0.05
- Model drift: Accuracy drop > 5%
- Data volume: New training data exceeds threshold (10K samples)
For implementation details, see references/ml-pipelines.md.
Pattern 3: Feature Store Integration
Ensure consistent features between training and inference.
Architecture:
Offline Store (Training):
Parquet/BigQuery → Point-in-Time Join → Training Dataset
Online Store (Inference):
Redis/DynamoDB → Low-Latency Lookup → Real-Time Prediction
Point-in-Time Correctness:
- Training: Fetch features as of specific timestamps (no future data)
- Inference: Fetch latest features (only past data)
- Guarantee: Same feature logic in training and inference
For implementation details and code examples, see references/feature-stores.md.
Pattern 4: Shadow Deployment Testing
Test new models in production without risk.
Workflow:
- Deploy new model (v2) in shadow mode
- v2 receives copy of production traffic
- v1 predictions used for responses (no user impact)
- Compare v1 and v2 predictions offline
- Analyze differences, measure v2 accuracy
- Promote v2 to production if performance acceptable
Use Cases:
- High-risk models (financial, healthcare, safety-critical)
- Need extensive testing before cutover
- Compare model behavior on real production data
For deployment architecture, see references/deployment-strategies.md.
Tool Recommendations
Production-Ready Tools (High Adoption)
MLflow - Experiment Tracking & Model Registry
- GitHub Stars: 20,000+
- Trust Score: 95/100
- Use Cases: Experiment tracking, model registry, model serving
- Strengths: Open-source, framework-agnostic, self-hosted option
- Getting Started:
pip install mlflow && mlflow server
Feast - Feature Store
- GitHub Stars: 5,000+
- Trust Score: 85/100
- Use Cases: Online/offline feature serving, point-in-time correctness
- Strengths: Cloud-agnostic, most popular open-source feature store
- Getting Started:
pip install feast && feast init
Seldon Core - Model Serving (Advanced)
- GitHub Stars: 4,000+
- Trust Score: 85/100
- Use Cases: Kubernetes-native serving, advanced deployment patterns
- Strengths: Canary, A/B testing, MAB, explainability
- Limitation: High complexity, steep learning curve
KServe - Model Serving (CNCF Standard)
- GitHub Stars: 3,500+
- Trust Score: 85/100
- Use Cases: Standardized serving API, serverless scaling
- Strengths: CNCF project, Knative integration, growing adoption
- Limitation: Kubernetes required
BentoML - Model Serving (Simplicity)
- GitHub Stars: 6,000+
- Trust Score: 80/100
- Use Cases: Easy packaging, Python-first deployment
- Strengths: Lowest learning curve, excellent developer experience
- Limitation: Fewer advanced features than Seldon/KServe
Kubeflow Pipelines - ML Orchestration
- GitHub Stars: 14,000+ (Kubeflow project)
- Trust Score: 90/100
- Use Cases: ML-specific pipelines, Kubernetes-native workflows
- Strengths: ML-native, component reusability, Katib integration
- Limitation: Kubernetes required, steep learning curve
Weights & Biases - Experiment Tracking (SaaS)
- Trust Score: 90/100
- Use Cases: Team collaboration, advanced visualization, hyperparameter tuning
- Strengths: Best-in-class UI, integrated Sweeps, strong community
- Limitation: SaaS pricing, no self-hosted free tier
For detailed tool comparisons, see references/tool-recommendations.md.
Tool Stack Recommendations by Organization
Startup (Cost-Optimized, Simple):
- Experiment Tracking: MLflow (free, self-hosted)
- Feature Store: None initially → Feast when needed
- Model Serving: BentoML (easy) or cloud functions
- Orchestration: Prefect or cron jobs
- Monitoring: Basic logging + Prometheus
Growth Company (Balanced):
- Experiment Tracking: Weights & Biases or MLflow
- Feature Store: Feast (open-source, production-ready)
- Model Serving: BentoML or KServe (Kubernetes-based)
- Orchestration: Kubeflow Pipelines or Airflow
- Monitoring: Evidently + Prometheus + Grafana
Enterprise (Full Stack):
- Experiment Tracking: MLflow (self-hosted) or Neptune.ai (compliance)
- Feature Store: Tecton (managed) or Feast (self-hosted)
- Model Serving: Seldon Core (advanced) or KServe
- Orchestration: Kubeflow Pipelines or Airflow
- Monitoring: Evidently + Prometheus + Grafana + PagerDuty
Cloud-Native (Managed Services):
- AWS: SageMaker (end-to-end platform)
- GCP: Vertex AI (end-to-end platform)
- Azure: Azure ML (end-to-end platform)
For scenario-specific recommendations, see references/scenarios.md.
Common Scenarios
Scenario 1: Startup MLOps Stack
Context: 20-person startup, 5 data scientists, 3 models (fraud detection, recommendation, churn), limited budget.
Recommendation:
- Experiment Tracking: MLflow (free, self-hosted)
- Model Serving: BentoML (simple, fast iteration)
- Orchestration: Prefect (simpler than Airflow)
- Monitoring: Prometheus + basic drift detection
- Feature Store: Skip initially, use database tables
Rationale:
- Minimize cost (all open-source, self-hosted)
- Fast iteration (BentoML easy to deploy)
- Don't over-engineer (no Kubeflow for 3 models)
- Add feature store (Feast) when scaling to 10+ models
For detailed scenario, see references/scenarios.md.
Scenario 2: Enterprise ML Platform
Context: 500-person company, 50 data scientists, 100+ models, regulatory compliance, multi-cloud.
Recommendation:
- Experiment Tracking: Neptune.ai (compliance, audit logs) or MLflow (cost)
- Feature Store: Feast (self-hosted, cloud-agnostic)
- Model Serving: Seldon Core (advanced deployment patterns)
- Orchestration: Kubeflow Pipelines (ML-native, Kubernetes)
- Monitoring: Evidently + Prometheus + Grafana + PagerDuty
Rationale:
- Compliance required (Neptune audit logs, RBAC)
- Multi-cloud (Feast cloud-agnostic)
- Advanced deployments (Seldon canary, A/B testing)
- Scale (Kubernetes for 100+ models)
For detailed scenario, see references/scenarios.md.
Scenario 3: LLM Fine-Tuning Pipeline
Context: Fine-tune LLM for domain-specific use case, deploy for production serving.
Recommendation:
- Experiment Tracking: MLflow (track fine-tuning runs)
- Pipeline Orchestration: Kubeflow Pipelines (GPU scheduling)
- Model Serving: vLLM (high-throughput LLM serving)
- Prompt Versioning: Git + LangSmith
- Monitoring: Arize Phoenix (RAG monitoring)
Rationale:
- Track fine-tuning experiments (LoRA adapters, hyperparameters)
- GPU orchestration (Kubeflow on Kubernetes)
- Efficient LLM serving (vLLM optimized for throughput)
- Monitor RAG systems (retrieval + generation quality)
For detailed scenario, see references/scenarios.md.
Integration with Other Skills
Direct Dependencies:
ai-data-engineering: Feature engineering, ML algorithms, data preparationkubernetes-operations: K8s cluster management, GPU scheduling for ML workloadsobservability: Monitoring, alerting, distributed tracing for ML systems
Complementary Skills:
data-architecture: Data pipelines, data lakes feeding ML modelsdata-transformation: dbt for feature transformation pipelinesstreaming-data: Kafka, Flink for real-time ML inferencedesigning-distributed-systems: Scalability patterns for ML workloadsapi-design-principles: ML model APIs, REST/gRPC serving patterns
Downstream Skills:
building-ai-chat: LLM-powered applications consuming ML modelsvisualizing-data: Dashboards for ML metrics and monitoring
Best Practices
-
Version Everything:
- Code: Git commit SHA for reproducibility
- Data: DVC or data version hash
- Models: Semantic versioning (v1.2.3)
- Features: Feature store versioning
-
Automate Testing:
- Unit tests: Model loads, accepts input, produces output
- Integration tests: End-to-end pipeline execution
- Model validation: Accuracy thresholds, fairness checks
-
Monitor Continuously:
- Data drift: Distribution changes over time
- Model drift: Accuracy degradation
- Performance: Latency, throughput, error rates
-
Start Simple:
- Begin with MLflow + basic serving (BentoML)
- Add complexity as needed (feature store, Kubeflow)
- Avoid over-engineering (don't build Kubeflow for 2 models)
-
Point-in-Time Correctness:
- Use feature stores to avoid training/serving skew
- Ensure no future data leakage in training
- Consistent feature logic in training and inference
-
Deployment Strategies:
- Use canary for medium-risk models (gradual rollout)
- Use shadow for high-risk models (zero production impact)
- Always have rollback plan (instant switch to previous version)
-
Governance:
- Model cards: Document model purpose, limitations, biases
- Audit trails: Track all model versions, deployments, approvals
- Compliance: EU AI Act, model risk management (SR 11-7)
-
Cost Optimization:
- Quantization: Reduce model size 4x, inference speed 2-3x
- Spot instances: Train on preemptible VMs (60-90% cost reduction)
- Autoscaling: Scale inference endpoints based on load
Anti-Patterns
❌ Notebooks in Production:
- Never deploy Jupyter notebooks to production
- Use notebooks for experimentation only
- Production: Use scripts, Docker containers, CI/CD pipelines
❌ Manual Model Deployment:
- Automate deployment with CI/CD pipelines
- Use model registry stage transitions (staging → production)
- Eliminate human error, ensure reproducibility
❌ No Monitoring:
- Production models without monitoring will degrade silently
- Implement drift detection (data drift, model drift)
- Set up alerting for accuracy drops, latency spikes
❌ Training/Serving Skew:
- Different feature logic in training vs inference
- Use feature stores to ensure consistency
- Test feature parity before production deployment
❌ Ignoring Data Quality:
- Garbage in, garbage out (GIGO)
- Validate data schema, ranges, distributions
- Use Great Expectations for data validation
❌ Over-Engineering:
- Don't build Kubeflow for 2 models
- Start simple (MLflow + BentoML)
- Add complexity only when necessary (10+ models)
❌ No Rollback Plan:
- Always have ability to rollback to previous model version
- Blue-green, canary, shadow deployments enable instant rollback
- Test rollback procedure before production deployment
Further Reading
Reference Files:
- Experiment Tracking - MLflow, W&B, Neptune deep dive
- Model Registry - Versioning, lineage, stage transitions
- Feature Stores - Feast, Tecton, online/offline patterns
- Model Serving - Seldon, KServe, BentoML, optimization
- Deployment Strategies - Blue-green, canary, shadow, A/B
- ML Pipelines - Kubeflow, Airflow, training pipelines
- Model Monitoring - Drift detection, observability
- LLMOps Patterns - LLM fine-tuning, RAG, prompts
- Decision Frameworks - Tool selection frameworks
- Tool Recommendations - Detailed comparisons
- Scenarios - Startup, enterprise, LLMOps use cases
- Governance - Model cards, compliance, fairness
Example Projects:
- examples/mlflow-experiment/ - Complete MLflow setup
- examples/feast-feature-store/ - Feast online/offline
- examples/seldon-deployment/ - Canary, A/B testing
- examples/kubeflow-pipeline/ - End-to-end pipeline
- examples/monitoring-dashboard/ - Evidently + Prometheus
Scripts:
- scripts/setup_mlflow_server.sh - MLflow with PostgreSQL + S3
- scripts/feast_feature_definition_generator.py - Generate Feast features
- scripts/model_validation_suite.py - Automated model tests
- scripts/drift_detection_monitor.py - Scheduled drift detection
- scripts/kubernetes_model_deploy.py - Deploy to Seldon/KServe