ML Pipeline Orchestrator

Overview

This skill transforms ML development into a SpecWeave increment-based workflow, ensuring every ML project follows the same disciplined approach: spec → plan → tasks → implement → validate. It orchestrates the complete ML lifecycle from data exploration to model deployment, with full traceability and living documentation.

Core Philosophy

SpecWeave + ML = Disciplined Data Science

Traditional ML development often lacks structure:

❌ Jupyter notebooks with no version control
❌ Experiments without documentation
❌ Models deployed with no reproducibility
❌ Team knowledge trapped in individual notebooks

SpecWeave brings discipline:

✅ Every ML feature is an increment (with spec, plan, tasks)
✅ Experiments tracked and documented automatically
✅ Model versions tied to increments
✅ Living docs capture learnings and decisions

How It Works

Phase 1: ML Increment Planning

When you request "build a recommendation model", the skill:

Creates ML increment structure:

.specweave/increments/0042-recommendation-model/ ├── spec.md # ML requirements, success metrics ├── plan.md # Pipeline architecture ├── tasks.md # Implementation tasks ├── tests.md # Evaluation criteria ├── experiments/ # Experiment tracking │ ├── exp-001-baseline/ │ ├── exp-002-xgboost/ │ └── exp-003-neural-net/ ├── data/ # Data samples, schemas │ ├── schema.yaml │ └── sample.csv ├── models/ # Trained models │ ├── model-v1.pkl │ └── model-v2.pkl └── notebooks/ # Exploratory notebooks ├── 01-eda.ipynb └── 02-feature-engineering.ipynb

Generates ML-specific spec (spec.md):

ML Problem Definition

Problem type: Recommendation (collaborative filtering)
Input: User behavior history
Output: Top-N product recommendations
Success metrics: Precision@10 > 0.25, Recall@10 > 0.15

Data Requirements

Training data: 6 months user interactions
Validation: Last month
Features: User profile, product attributes, interaction history

Model Requirements

Latency: <100ms inference
Throughput: 1000 req/sec
Accuracy: Better than random baseline by 3x
Explainability: Must explain top-3 recommendations
Creates ML-specific tasks (tasks.md):
T-001: Data exploration and quality analysis
T-002: Feature engineering pipeline
T-003: Train baseline model (random/popularity)
T-004: Train candidate models (3 algorithms)
T-005: Hyperparameter tuning (best model)
T-006: Model evaluation (all metrics)
T-007: Model explainability (SHAP/LIME)
T-008: Production deployment preparation
T-009: A/B test plan

Phase 2: Pipeline Execution

The skill guides through each task with best practices:

Task 1: Data Exploration

Generated template with SpecWeave integration

import pandas as pd import mlflow from specweave import track_experiment

Auto-logs to .specweave/increments/0042.../experiments/

with track_experiment("exp-001-eda") as exp: df = pd.read_csv("data/interactions.csv")

# EDA
exp.log_param("dataset_size", len(df))
exp.log_metric("missing_values", df.isnull().sum().sum())

# Auto-generates report in increment folder
exp.save_report("eda-summary.md")

Task 3: Train Baseline

from sklearn.dummy import DummyClassifier from specweave import track_model

with track_model("baseline-random", increment="0042") as model: clf = DummyClassifier(strategy="uniform") clf.fit(X_train, y_train)

# Automatically logged to increment
model.log_metrics({
    "accuracy": 0.12,
    "precision@10": 0.08
})
model.save_artifact(clf, "baseline.pkl")

Task 4: Train Candidate Models

from xgboost import XGBClassifier from specweave import ModelExperiment

Parallel experiments with auto-tracking

experiments = [ ModelExperiment("xgboost", XGBClassifier, params_xgb), ModelExperiment("lightgbm", LGBMClassifier, params_lgbm), ModelExperiment("neural-net", KerasModel, params_nn) ]

results = run_experiments( experiments, increment="0042", save_to="experiments/" )

Auto-generates comparison table in increment docs

Phase 3: Increment Completion

When /sw:done 0042 runs:

Validates ML-specific criteria:

✅ All experiments logged
✅ Best model saved
✅ Evaluation metrics documented
✅ Model explainability artifacts present

Generates completion summary:

Recommendation Model - COMPLETE

Experiments Run: 7

exp-001-baseline (random): precision@10=0.08
exp-002-popularity: precision@10=0.18
exp-003-xgboost: precision@10=0.26 ✅ BEST
exp-004-lightgbm: precision@10=0.24
exp-005-neural-net: precision@10=0.22 ...

Best Model

Algorithm: XGBoost
Version: model-v3.pkl
Metrics: precision@10=0.26, recall@10=0.16
Training time: 45 min
Model size: 12 MB

Deployment Ready

✅ Inference latency: 35ms (target: <100ms)
✅ Explainability: SHAP values computed
✅ A/B test plan documented
Syncs living docs (via /sw:sync-docs ):
Updates architecture docs with model design
Adds ADR for algorithm selection
Documents learnings in runbooks

When to Use This Skill

Activate this skill when you need to:

Build ML features end-to-end - From idea to deployed model
Ensure reproducibility - Every experiment tracked and documented
Follow ML best practices - Baseline comparison, proper validation, explainability
Integrate ML with software engineering - ML as increments, not isolated notebooks
Maintain team knowledge - Living docs capture why decisions were made

ML Pipeline Stages

Data Stage

Data exploration (EDA)
Data quality assessment
Schema validation
Sample data documentation

Feature Stage

Feature engineering
Feature selection
Feature importance analysis
Feature store integration (optional)

Training Stage

Baseline model (random, rule-based)
Candidate models (3+ algorithms)
Hyperparameter tuning
Cross-validation

Evaluation Stage

Comprehensive metrics (accuracy, precision, recall, F1, AUC)
Business metrics (latency, throughput)
Model comparison (vs baseline, vs previous version)
Error analysis

Explainability Stage

Feature importance
SHAP values
LIME explanations
Example predictions with rationale

Deployment Stage

Model packaging
Inference pipeline
A/B test plan
Monitoring setup

Integration with SpecWeave Workflow

With Experiment Tracking

Start ML increment

/sw:inc "0042-recommendation-model"

Automatically integrates experiment tracking

All MLflow/W&B logs saved to increment folder

With Living Docs

After training best model

/sw:sync-docs update

Automatically:

- Updates architecture/ml-models.md

- Adds ADR for algorithm choice

- Documents hyperparameters in runbooks

With GitHub Sync

Create GitHub issue for model retraining

/sw:github:create-issue "Retrain recommendation model with new data"

Linked to increment 0042

Issue tracks model performance over time

Best Practices

Always Start with Baseline

Before training complex models, establish baseline

baseline_results = train_baseline_model( strategies=["random", "popularity", "rule-based"] )

Requirement: New model must beat best baseline by 20%+

Use Cross-Validation

Never trust single train/test split

cv_scores = cross_val_score(model, X, y, cv=5) exp.log_metric("cv_mean", cv_scores.mean()) exp.log_metric("cv_std", cv_scores.std())

Track Everything

Hyperparameters, metrics, artifacts, environment

exp.log_params(model.get_params()) exp.log_metrics({"accuracy": acc, "f1": f1}) exp.log_artifact("model.pkl") exp.log_artifact("requirements.txt") # Reproducibility

Document Failures

Failed experiments are valuable learnings

with track_experiment("exp-006-failed-lstm") as exp: # ... training fails ... exp.log_note("FAILED: LSTM overfits badly, needs regularization") exp.set_status("failed")

This documents why LSTM wasn't chosen

Model Versioning

Tie model versions to increments

model_version = f"0042-v{iteration}" mlflow.register_model( f"runs:/{run_id}/model", f"recommendation-model-{model_version}" )

Examples

Example 1: Classification Pipeline

User: "Build a fraud detection model for transactions"

Skill creates increment 0051-fraud-detection with:

spec.md: Binary classification, 99% precision target
plan.md: Imbalanced data handling, threshold tuning
tasks.md: 9 tasks from EDA to deployment
experiments/: exp-001-baseline, exp-002-xgboost, etc.

Guides through:

EDA → identify class imbalance (0.1% fraud)
Baseline → random/majority (terrible results)
Candidates → XGBoost, LightGBM, Neural Net
Threshold tuning → optimize for precision
SHAP → explain high-risk predictions
Deploy → model + threshold + explainer

Example 2: Regression Pipeline

User: "Predict customer lifetime value"

Skill creates increment 0063-ltv-prediction with:

spec.md: Regression, RMSE < $50 target
plan.md: Time-based validation, feature engineering
tasks.md: Customer cohort analysis, feature importance

Key difference: Regression-specific evaluation (RMSE, MAE, R²)

Example 3: Time Series Forecasting

User: "Forecast weekly sales for next 12 weeks"

Skill creates increment 0072-sales-forecasting with:

spec.md: Time series, MAPE < 10% target
plan.md: Seasonal decomposition, ARIMA vs Prophet
tasks.md: Stationarity tests, residual analysis

Key difference: Time series validation (no random split)

Framework Support

This skill works with all major ML frameworks:

Scikit-Learn

from sklearn.ensemble import RandomForestClassifier from specweave import track_sklearn_model

model = RandomForestClassifier(n_estimators=100) with track_sklearn_model(model, increment="0042") as tracked: tracked.fit(X_train, y_train) tracked.evaluate(X_test, y_test)

PyTorch

import torch from specweave import track_pytorch_model

model = NeuralNet() with track_pytorch_model(model, increment="0042") as tracked: for epoch in range(epochs): tracked.train_epoch(train_loader) tracked.log_metric(f"loss_epoch_{epoch}", loss)

TensorFlow/Keras

from tensorflow import keras from specweave import KerasCallback

model = keras.Sequential([...]) model.fit( X_train, y_train, callbacks=[KerasCallback(increment="0042")] )

XGBoost/LightGBM

import xgboost as xgb from specweave import track_boosting_model

dtrain = xgb.DMatrix(X_train, label=y_train) with track_boosting_model("xgboost", increment="0042") as tracked: model = xgb.train(params, dtrain, callbacks=[tracked.callback])

Integration Points

With experiment-tracker skill

Auto-detects MLflow/W&B in project
Configures tracking URI to increment folder
Syncs experiment metadata to increment docs

With model-evaluator skill

Generates comprehensive evaluation reports
Compares models across experiments
Highlights best model with confidence intervals

With feature-engineer skill

Generates feature engineering pipeline
Documents feature importance
Creates feature store schemas

With ml-engineer agent

Delegates complex ML decisions to specialized agent
Reviews model architecture
Suggests improvements based on results

Skill Outputs

After running /sw:do on an ML increment, you get:

.specweave/increments/0042-recommendation-model/ ├── spec.md ✅ ├── plan.md ✅ ├── tasks.md ✅ (all completed) ├── COMPLETION-SUMMARY.md ✅ ├── experiments/ │ ├── exp-001-baseline/ │ │ ├── metrics.json │ │ ├── params.json │ │ └── logs/ │ ├── exp-002-xgboost/ ✅ BEST │ │ ├── metrics.json │ │ ├── params.json │ │ ├── model.pkl │ │ └── shap_values.pkl │ └── comparison.md ├── models/ │ ├── model-v3.pkl (best) │ └── model-v3.metadata.json ├── data/ │ ├── schema.yaml │ └── sample.parquet └── notebooks/ ├── 01-eda.ipynb ├── 02-feature-engineering.ipynb └── 03-model-analysis.ipynb

Commands

This skill integrates with SpecWeave commands:

Create ML increment

/sw:inc "build recommendation model" → Activates ml-pipeline-orchestrator → Creates ML-specific increment structure

Execute ML tasks

/sw:do → Guides through data → train → eval workflow → Auto-tracks experiments

Validate ML increment

/sw:validate 0042 → Checks: experiments logged, model saved, metrics documented → Validates: model meets success criteria

Complete ML increment

/sw:done 0042 → Generates ML completion summary → Syncs model metadata to living docs

Tips

Start simple - Always begin with baseline, then iterate
Track failures - Document why approaches didn't work
Version data - Use DVC or similar for data versioning
Reproducibility - Log environment (requirements.txt, conda env)
Incremental improvement - Each increment improves on previous model
Team collaboration - Living docs make ML decisions visible to all

Advanced: Multi-Increment ML Projects

For complex ML systems (e.g., recommendation system with multiple models):

0042-recommendation-data-pipeline 0043-recommendation-candidate-generation 0044-recommendation-ranking-model 0045-recommendation-reranking 0046-recommendation-ab-test

Each increment:

Has its own spec, plan, tasks
Builds on previous increments
Documents model interactions
Maintains system-level living docs