ML Pipeline Orchestrator
Overview
This skill transforms ML development into a SpecWeave increment-based workflow, ensuring every ML project follows the same disciplined approach: spec → plan → tasks → implement → validate. It orchestrates the complete ML lifecycle from data exploration to model deployment, with full traceability and living documentation.
Core Philosophy
SpecWeave + ML = Disciplined Data Science
Traditional ML development often lacks structure:
-
❌ Jupyter notebooks with no version control
-
❌ Experiments without documentation
-
❌ Models deployed with no reproducibility
-
❌ Team knowledge trapped in individual notebooks
SpecWeave brings discipline:
-
✅ Every ML feature is an increment (with spec, plan, tasks)
-
✅ Experiments tracked and documented automatically
-
✅ Model versions tied to increments
-
✅ Living docs capture learnings and decisions
How It Works
Phase 1: ML Increment Planning
When you request "build a recommendation model", the skill:
- Creates ML increment structure:
.specweave/increments/0042-recommendation-model/ ├── spec.md # ML requirements, success metrics ├── plan.md # Pipeline architecture ├── tasks.md # Implementation tasks ├── tests.md # Evaluation criteria ├── experiments/ # Experiment tracking │ ├── exp-001-baseline/ │ ├── exp-002-xgboost/ │ └── exp-003-neural-net/ ├── data/ # Data samples, schemas │ ├── schema.yaml │ └── sample.csv ├── models/ # Trained models │ ├── model-v1.pkl │ └── model-v2.pkl └── notebooks/ # Exploratory notebooks ├── 01-eda.ipynb └── 02-feature-engineering.ipynb
- Generates ML-specific spec (spec.md):
ML Problem Definition
- Problem type: Recommendation (collaborative filtering)
- Input: User behavior history
- Output: Top-N product recommendations
- Success metrics: Precision@10 > 0.25, Recall@10 > 0.15
Data Requirements
- Training data: 6 months user interactions
- Validation: Last month
- Features: User profile, product attributes, interaction history
Model Requirements
-
Latency: <100ms inference
-
Throughput: 1000 req/sec
-
Accuracy: Better than random baseline by 3x
-
Explainability: Must explain top-3 recommendations
-
Creates ML-specific tasks (tasks.md):
-
T-001: Data exploration and quality analysis
-
T-002: Feature engineering pipeline
-
T-003: Train baseline model (random/popularity)
-
T-004: Train candidate models (3 algorithms)
-
T-005: Hyperparameter tuning (best model)
-
T-006: Model evaluation (all metrics)
-
T-007: Model explainability (SHAP/LIME)
-
T-008: Production deployment preparation
-
T-009: A/B test plan
Phase 2: Pipeline Execution
The skill guides through each task with best practices:
Task 1: Data Exploration
Generated template with SpecWeave integration
import pandas as pd import mlflow from specweave import track_experiment
Auto-logs to .specweave/increments/0042.../experiments/
with track_experiment("exp-001-eda") as exp: df = pd.read_csv("data/interactions.csv")
# EDA
exp.log_param("dataset_size", len(df))
exp.log_metric("missing_values", df.isnull().sum().sum())
# Auto-generates report in increment folder
exp.save_report("eda-summary.md")
Task 3: Train Baseline
from sklearn.dummy import DummyClassifier from specweave import track_model
with track_model("baseline-random", increment="0042") as model: clf = DummyClassifier(strategy="uniform") clf.fit(X_train, y_train)
# Automatically logged to increment
model.log_metrics({
"accuracy": 0.12,
"precision@10": 0.08
})
model.save_artifact(clf, "baseline.pkl")
Task 4: Train Candidate Models
from xgboost import XGBClassifier from specweave import ModelExperiment
Parallel experiments with auto-tracking
experiments = [ ModelExperiment("xgboost", XGBClassifier, params_xgb), ModelExperiment("lightgbm", LGBMClassifier, params_lgbm), ModelExperiment("neural-net", KerasModel, params_nn) ]
results = run_experiments( experiments, increment="0042", save_to="experiments/" )
Auto-generates comparison table in increment docs
Phase 3: Increment Completion
When /sw:done 0042 runs:
Validates ML-specific criteria:
-
✅ All experiments logged
-
✅ Best model saved
-
✅ Evaluation metrics documented
-
✅ Model explainability artifacts present
Generates completion summary:
Recommendation Model - COMPLETE
Experiments Run: 7
- exp-001-baseline (random): precision@10=0.08
- exp-002-popularity: precision@10=0.18
- exp-003-xgboost: precision@10=0.26 ✅ BEST
- exp-004-lightgbm: precision@10=0.24
- exp-005-neural-net: precision@10=0.22 ...
Best Model
- Algorithm: XGBoost
- Version: model-v3.pkl
- Metrics: precision@10=0.26, recall@10=0.16
- Training time: 45 min
- Model size: 12 MB
Deployment Ready
-
✅ Inference latency: 35ms (target: <100ms)
-
✅ Explainability: SHAP values computed
-
✅ A/B test plan documented
-
Syncs living docs (via /sw:sync-docs ):
-
Updates architecture docs with model design
-
Adds ADR for algorithm selection
-
Documents learnings in runbooks
When to Use This Skill
Activate this skill when you need to:
-
Build ML features end-to-end - From idea to deployed model
-
Ensure reproducibility - Every experiment tracked and documented
-
Follow ML best practices - Baseline comparison, proper validation, explainability
-
Integrate ML with software engineering - ML as increments, not isolated notebooks
-
Maintain team knowledge - Living docs capture why decisions were made
ML Pipeline Stages
- Data Stage
-
Data exploration (EDA)
-
Data quality assessment
-
Schema validation
-
Sample data documentation
- Feature Stage
-
Feature engineering
-
Feature selection
-
Feature importance analysis
-
Feature store integration (optional)
- Training Stage
-
Baseline model (random, rule-based)
-
Candidate models (3+ algorithms)
-
Hyperparameter tuning
-
Cross-validation
- Evaluation Stage
-
Comprehensive metrics (accuracy, precision, recall, F1, AUC)
-
Business metrics (latency, throughput)
-
Model comparison (vs baseline, vs previous version)
-
Error analysis
- Explainability Stage
-
Feature importance
-
SHAP values
-
LIME explanations
-
Example predictions with rationale
- Deployment Stage
-
Model packaging
-
Inference pipeline
-
A/B test plan
-
Monitoring setup
Integration with SpecWeave Workflow
With Experiment Tracking
Start ML increment
/sw:inc "0042-recommendation-model"
Automatically integrates experiment tracking
All MLflow/W&B logs saved to increment folder
With Living Docs
After training best model
/sw:sync-docs update
Automatically:
- Updates architecture/ml-models.md
- Adds ADR for algorithm choice
- Documents hyperparameters in runbooks
With GitHub Sync
Create GitHub issue for model retraining
/sw:github:create-issue "Retrain recommendation model with new data"
Linked to increment 0042
Issue tracks model performance over time
Best Practices
- Always Start with Baseline
Before training complex models, establish baseline
baseline_results = train_baseline_model( strategies=["random", "popularity", "rule-based"] )
Requirement: New model must beat best baseline by 20%+
- Use Cross-Validation
Never trust single train/test split
cv_scores = cross_val_score(model, X, y, cv=5) exp.log_metric("cv_mean", cv_scores.mean()) exp.log_metric("cv_std", cv_scores.std())
- Track Everything
Hyperparameters, metrics, artifacts, environment
exp.log_params(model.get_params()) exp.log_metrics({"accuracy": acc, "f1": f1}) exp.log_artifact("model.pkl") exp.log_artifact("requirements.txt") # Reproducibility
- Document Failures
Failed experiments are valuable learnings
with track_experiment("exp-006-failed-lstm") as exp: # ... training fails ... exp.log_note("FAILED: LSTM overfits badly, needs regularization") exp.set_status("failed")
This documents why LSTM wasn't chosen
- Model Versioning
Tie model versions to increments
model_version = f"0042-v{iteration}" mlflow.register_model( f"runs:/{run_id}/model", f"recommendation-model-{model_version}" )
Examples
Example 1: Classification Pipeline
User: "Build a fraud detection model for transactions"
Skill creates increment 0051-fraud-detection with:
- spec.md: Binary classification, 99% precision target
- plan.md: Imbalanced data handling, threshold tuning
- tasks.md: 9 tasks from EDA to deployment
- experiments/: exp-001-baseline, exp-002-xgboost, etc.
Guides through:
- EDA → identify class imbalance (0.1% fraud)
- Baseline → random/majority (terrible results)
- Candidates → XGBoost, LightGBM, Neural Net
- Threshold tuning → optimize for precision
- SHAP → explain high-risk predictions
- Deploy → model + threshold + explainer
Example 2: Regression Pipeline
User: "Predict customer lifetime value"
Skill creates increment 0063-ltv-prediction with:
- spec.md: Regression, RMSE < $50 target
- plan.md: Time-based validation, feature engineering
- tasks.md: Customer cohort analysis, feature importance
Key difference: Regression-specific evaluation (RMSE, MAE, R²)
Example 3: Time Series Forecasting
User: "Forecast weekly sales for next 12 weeks"
Skill creates increment 0072-sales-forecasting with:
- spec.md: Time series, MAPE < 10% target
- plan.md: Seasonal decomposition, ARIMA vs Prophet
- tasks.md: Stationarity tests, residual analysis
Key difference: Time series validation (no random split)
Framework Support
This skill works with all major ML frameworks:
Scikit-Learn
from sklearn.ensemble import RandomForestClassifier from specweave import track_sklearn_model
model = RandomForestClassifier(n_estimators=100) with track_sklearn_model(model, increment="0042") as tracked: tracked.fit(X_train, y_train) tracked.evaluate(X_test, y_test)
PyTorch
import torch from specweave import track_pytorch_model
model = NeuralNet() with track_pytorch_model(model, increment="0042") as tracked: for epoch in range(epochs): tracked.train_epoch(train_loader) tracked.log_metric(f"loss_epoch_{epoch}", loss)
TensorFlow/Keras
from tensorflow import keras from specweave import KerasCallback
model = keras.Sequential([...]) model.fit( X_train, y_train, callbacks=[KerasCallback(increment="0042")] )
XGBoost/LightGBM
import xgboost as xgb from specweave import track_boosting_model
dtrain = xgb.DMatrix(X_train, label=y_train) with track_boosting_model("xgboost", increment="0042") as tracked: model = xgb.train(params, dtrain, callbacks=[tracked.callback])
Integration Points
With experiment-tracker skill
-
Auto-detects MLflow/W&B in project
-
Configures tracking URI to increment folder
-
Syncs experiment metadata to increment docs
With model-evaluator skill
-
Generates comprehensive evaluation reports
-
Compares models across experiments
-
Highlights best model with confidence intervals
With feature-engineer skill
-
Generates feature engineering pipeline
-
Documents feature importance
-
Creates feature store schemas
With ml-engineer agent
-
Delegates complex ML decisions to specialized agent
-
Reviews model architecture
-
Suggests improvements based on results
Skill Outputs
After running /sw:do on an ML increment, you get:
.specweave/increments/0042-recommendation-model/ ├── spec.md ✅ ├── plan.md ✅ ├── tasks.md ✅ (all completed) ├── COMPLETION-SUMMARY.md ✅ ├── experiments/ │ ├── exp-001-baseline/ │ │ ├── metrics.json │ │ ├── params.json │ │ └── logs/ │ ├── exp-002-xgboost/ ✅ BEST │ │ ├── metrics.json │ │ ├── params.json │ │ ├── model.pkl │ │ └── shap_values.pkl │ └── comparison.md ├── models/ │ ├── model-v3.pkl (best) │ └── model-v3.metadata.json ├── data/ │ ├── schema.yaml │ └── sample.parquet └── notebooks/ ├── 01-eda.ipynb ├── 02-feature-engineering.ipynb └── 03-model-analysis.ipynb
Commands
This skill integrates with SpecWeave commands:
Create ML increment
/sw:inc "build recommendation model" → Activates ml-pipeline-orchestrator → Creates ML-specific increment structure
Execute ML tasks
/sw:do → Guides through data → train → eval workflow → Auto-tracks experiments
Validate ML increment
/sw:validate 0042 → Checks: experiments logged, model saved, metrics documented → Validates: model meets success criteria
Complete ML increment
/sw:done 0042 → Generates ML completion summary → Syncs model metadata to living docs
Tips
-
Start simple - Always begin with baseline, then iterate
-
Track failures - Document why approaches didn't work
-
Version data - Use DVC or similar for data versioning
-
Reproducibility - Log environment (requirements.txt, conda env)
-
Incremental improvement - Each increment improves on previous model
-
Team collaboration - Living docs make ML decisions visible to all
Advanced: Multi-Increment ML Projects
For complex ML systems (e.g., recommendation system with multiple models):
0042-recommendation-data-pipeline 0043-recommendation-candidate-generation 0044-recommendation-ranking-model 0045-recommendation-reranking 0046-recommendation-ab-test
Each increment:
-
Has its own spec, plan, tasks
-
Builds on previous increments
-
Documents model interactions
-
Maintains system-level living docs