ml-pipeline-orchestrator

ML Pipeline Orchestrator

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ml-pipeline-orchestrator" with this command: npx skills add anton-abyzov/specweave/anton-abyzov-specweave-ml-pipeline-orchestrator

ML Pipeline Orchestrator

Overview

This skill transforms ML development into a SpecWeave increment-based workflow, ensuring every ML project follows the same disciplined approach: spec → plan → tasks → implement → validate. It orchestrates the complete ML lifecycle from data exploration to model deployment, with full traceability and living documentation.

Core Philosophy

SpecWeave + ML = Disciplined Data Science

Traditional ML development often lacks structure:

  • ❌ Jupyter notebooks with no version control

  • ❌ Experiments without documentation

  • ❌ Models deployed with no reproducibility

  • ❌ Team knowledge trapped in individual notebooks

SpecWeave brings discipline:

  • ✅ Every ML feature is an increment (with spec, plan, tasks)

  • ✅ Experiments tracked and documented automatically

  • ✅ Model versions tied to increments

  • ✅ Living docs capture learnings and decisions

How It Works

Phase 1: ML Increment Planning

When you request "build a recommendation model", the skill:

  • Creates ML increment structure:

.specweave/increments/0042-recommendation-model/ ├── spec.md # ML requirements, success metrics ├── plan.md # Pipeline architecture ├── tasks.md # Implementation tasks ├── tests.md # Evaluation criteria ├── experiments/ # Experiment tracking │ ├── exp-001-baseline/ │ ├── exp-002-xgboost/ │ └── exp-003-neural-net/ ├── data/ # Data samples, schemas │ ├── schema.yaml │ └── sample.csv ├── models/ # Trained models │ ├── model-v1.pkl │ └── model-v2.pkl └── notebooks/ # Exploratory notebooks ├── 01-eda.ipynb └── 02-feature-engineering.ipynb

  • Generates ML-specific spec (spec.md):

ML Problem Definition

  • Problem type: Recommendation (collaborative filtering)
  • Input: User behavior history
  • Output: Top-N product recommendations
  • Success metrics: Precision@10 > 0.25, Recall@10 > 0.15

Data Requirements

  • Training data: 6 months user interactions
  • Validation: Last month
  • Features: User profile, product attributes, interaction history

Model Requirements

  • Latency: <100ms inference

  • Throughput: 1000 req/sec

  • Accuracy: Better than random baseline by 3x

  • Explainability: Must explain top-3 recommendations

  • Creates ML-specific tasks (tasks.md):

  • T-001: Data exploration and quality analysis

  • T-002: Feature engineering pipeline

  • T-003: Train baseline model (random/popularity)

  • T-004: Train candidate models (3 algorithms)

  • T-005: Hyperparameter tuning (best model)

  • T-006: Model evaluation (all metrics)

  • T-007: Model explainability (SHAP/LIME)

  • T-008: Production deployment preparation

  • T-009: A/B test plan

Phase 2: Pipeline Execution

The skill guides through each task with best practices:

Task 1: Data Exploration

Generated template with SpecWeave integration

import pandas as pd import mlflow from specweave import track_experiment

Auto-logs to .specweave/increments/0042.../experiments/

with track_experiment("exp-001-eda") as exp: df = pd.read_csv("data/interactions.csv")

# EDA
exp.log_param("dataset_size", len(df))
exp.log_metric("missing_values", df.isnull().sum().sum())

# Auto-generates report in increment folder
exp.save_report("eda-summary.md")

Task 3: Train Baseline

from sklearn.dummy import DummyClassifier from specweave import track_model

with track_model("baseline-random", increment="0042") as model: clf = DummyClassifier(strategy="uniform") clf.fit(X_train, y_train)

# Automatically logged to increment
model.log_metrics({
    "accuracy": 0.12,
    "precision@10": 0.08
})
model.save_artifact(clf, "baseline.pkl")

Task 4: Train Candidate Models

from xgboost import XGBClassifier from specweave import ModelExperiment

Parallel experiments with auto-tracking

experiments = [ ModelExperiment("xgboost", XGBClassifier, params_xgb), ModelExperiment("lightgbm", LGBMClassifier, params_lgbm), ModelExperiment("neural-net", KerasModel, params_nn) ]

results = run_experiments( experiments, increment="0042", save_to="experiments/" )

Auto-generates comparison table in increment docs

Phase 3: Increment Completion

When /sw:done 0042 runs:

Validates ML-specific criteria:

  • ✅ All experiments logged

  • ✅ Best model saved

  • ✅ Evaluation metrics documented

  • ✅ Model explainability artifacts present

Generates completion summary:

Recommendation Model - COMPLETE

Experiments Run: 7

  1. exp-001-baseline (random): precision@10=0.08
  2. exp-002-popularity: precision@10=0.18
  3. exp-003-xgboost: precision@10=0.26 ✅ BEST
  4. exp-004-lightgbm: precision@10=0.24
  5. exp-005-neural-net: precision@10=0.22 ...

Best Model

  • Algorithm: XGBoost
  • Version: model-v3.pkl
  • Metrics: precision@10=0.26, recall@10=0.16
  • Training time: 45 min
  • Model size: 12 MB

Deployment Ready

  • ✅ Inference latency: 35ms (target: <100ms)

  • ✅ Explainability: SHAP values computed

  • ✅ A/B test plan documented

  • Syncs living docs (via /sw:sync-docs ):

  • Updates architecture docs with model design

  • Adds ADR for algorithm selection

  • Documents learnings in runbooks

When to Use This Skill

Activate this skill when you need to:

  • Build ML features end-to-end - From idea to deployed model

  • Ensure reproducibility - Every experiment tracked and documented

  • Follow ML best practices - Baseline comparison, proper validation, explainability

  • Integrate ML with software engineering - ML as increments, not isolated notebooks

  • Maintain team knowledge - Living docs capture why decisions were made

ML Pipeline Stages

  1. Data Stage
  • Data exploration (EDA)

  • Data quality assessment

  • Schema validation

  • Sample data documentation

  1. Feature Stage
  • Feature engineering

  • Feature selection

  • Feature importance analysis

  • Feature store integration (optional)

  1. Training Stage
  • Baseline model (random, rule-based)

  • Candidate models (3+ algorithms)

  • Hyperparameter tuning

  • Cross-validation

  1. Evaluation Stage
  • Comprehensive metrics (accuracy, precision, recall, F1, AUC)

  • Business metrics (latency, throughput)

  • Model comparison (vs baseline, vs previous version)

  • Error analysis

  1. Explainability Stage
  • Feature importance

  • SHAP values

  • LIME explanations

  • Example predictions with rationale

  1. Deployment Stage
  • Model packaging

  • Inference pipeline

  • A/B test plan

  • Monitoring setup

Integration with SpecWeave Workflow

With Experiment Tracking

Start ML increment

/sw:inc "0042-recommendation-model"

Automatically integrates experiment tracking

All MLflow/W&B logs saved to increment folder

With Living Docs

After training best model

/sw:sync-docs update

Automatically:

- Updates architecture/ml-models.md

- Adds ADR for algorithm choice

- Documents hyperparameters in runbooks

With GitHub Sync

Create GitHub issue for model retraining

/sw:github:create-issue "Retrain recommendation model with new data"

Linked to increment 0042

Issue tracks model performance over time

Best Practices

  1. Always Start with Baseline

Before training complex models, establish baseline

baseline_results = train_baseline_model( strategies=["random", "popularity", "rule-based"] )

Requirement: New model must beat best baseline by 20%+

  1. Use Cross-Validation

Never trust single train/test split

cv_scores = cross_val_score(model, X, y, cv=5) exp.log_metric("cv_mean", cv_scores.mean()) exp.log_metric("cv_std", cv_scores.std())

  1. Track Everything

Hyperparameters, metrics, artifacts, environment

exp.log_params(model.get_params()) exp.log_metrics({"accuracy": acc, "f1": f1}) exp.log_artifact("model.pkl") exp.log_artifact("requirements.txt") # Reproducibility

  1. Document Failures

Failed experiments are valuable learnings

with track_experiment("exp-006-failed-lstm") as exp: # ... training fails ... exp.log_note("FAILED: LSTM overfits badly, needs regularization") exp.set_status("failed")

This documents why LSTM wasn't chosen

  1. Model Versioning

Tie model versions to increments

model_version = f"0042-v{iteration}" mlflow.register_model( f"runs:/{run_id}/model", f"recommendation-model-{model_version}" )

Examples

Example 1: Classification Pipeline

User: "Build a fraud detection model for transactions"

Skill creates increment 0051-fraud-detection with:

  • spec.md: Binary classification, 99% precision target
  • plan.md: Imbalanced data handling, threshold tuning
  • tasks.md: 9 tasks from EDA to deployment
  • experiments/: exp-001-baseline, exp-002-xgboost, etc.

Guides through:

  1. EDA → identify class imbalance (0.1% fraud)
  2. Baseline → random/majority (terrible results)
  3. Candidates → XGBoost, LightGBM, Neural Net
  4. Threshold tuning → optimize for precision
  5. SHAP → explain high-risk predictions
  6. Deploy → model + threshold + explainer

Example 2: Regression Pipeline

User: "Predict customer lifetime value"

Skill creates increment 0063-ltv-prediction with:

  • spec.md: Regression, RMSE < $50 target
  • plan.md: Time-based validation, feature engineering
  • tasks.md: Customer cohort analysis, feature importance

Key difference: Regression-specific evaluation (RMSE, MAE, R²)

Example 3: Time Series Forecasting

User: "Forecast weekly sales for next 12 weeks"

Skill creates increment 0072-sales-forecasting with:

  • spec.md: Time series, MAPE < 10% target
  • plan.md: Seasonal decomposition, ARIMA vs Prophet
  • tasks.md: Stationarity tests, residual analysis

Key difference: Time series validation (no random split)

Framework Support

This skill works with all major ML frameworks:

Scikit-Learn

from sklearn.ensemble import RandomForestClassifier from specweave import track_sklearn_model

model = RandomForestClassifier(n_estimators=100) with track_sklearn_model(model, increment="0042") as tracked: tracked.fit(X_train, y_train) tracked.evaluate(X_test, y_test)

PyTorch

import torch from specweave import track_pytorch_model

model = NeuralNet() with track_pytorch_model(model, increment="0042") as tracked: for epoch in range(epochs): tracked.train_epoch(train_loader) tracked.log_metric(f"loss_epoch_{epoch}", loss)

TensorFlow/Keras

from tensorflow import keras from specweave import KerasCallback

model = keras.Sequential([...]) model.fit( X_train, y_train, callbacks=[KerasCallback(increment="0042")] )

XGBoost/LightGBM

import xgboost as xgb from specweave import track_boosting_model

dtrain = xgb.DMatrix(X_train, label=y_train) with track_boosting_model("xgboost", increment="0042") as tracked: model = xgb.train(params, dtrain, callbacks=[tracked.callback])

Integration Points

With experiment-tracker skill

  • Auto-detects MLflow/W&B in project

  • Configures tracking URI to increment folder

  • Syncs experiment metadata to increment docs

With model-evaluator skill

  • Generates comprehensive evaluation reports

  • Compares models across experiments

  • Highlights best model with confidence intervals

With feature-engineer skill

  • Generates feature engineering pipeline

  • Documents feature importance

  • Creates feature store schemas

With ml-engineer agent

  • Delegates complex ML decisions to specialized agent

  • Reviews model architecture

  • Suggests improvements based on results

Skill Outputs

After running /sw:do on an ML increment, you get:

.specweave/increments/0042-recommendation-model/ ├── spec.md ✅ ├── plan.md ✅ ├── tasks.md ✅ (all completed) ├── COMPLETION-SUMMARY.md ✅ ├── experiments/ │ ├── exp-001-baseline/ │ │ ├── metrics.json │ │ ├── params.json │ │ └── logs/ │ ├── exp-002-xgboost/ ✅ BEST │ │ ├── metrics.json │ │ ├── params.json │ │ ├── model.pkl │ │ └── shap_values.pkl │ └── comparison.md ├── models/ │ ├── model-v3.pkl (best) │ └── model-v3.metadata.json ├── data/ │ ├── schema.yaml │ └── sample.parquet └── notebooks/ ├── 01-eda.ipynb ├── 02-feature-engineering.ipynb └── 03-model-analysis.ipynb

Commands

This skill integrates with SpecWeave commands:

Create ML increment

/sw:inc "build recommendation model" → Activates ml-pipeline-orchestrator → Creates ML-specific increment structure

Execute ML tasks

/sw:do → Guides through data → train → eval workflow → Auto-tracks experiments

Validate ML increment

/sw:validate 0042 → Checks: experiments logged, model saved, metrics documented → Validates: model meets success criteria

Complete ML increment

/sw:done 0042 → Generates ML completion summary → Syncs model metadata to living docs

Tips

  • Start simple - Always begin with baseline, then iterate

  • Track failures - Document why approaches didn't work

  • Version data - Use DVC or similar for data versioning

  • Reproducibility - Log environment (requirements.txt, conda env)

  • Incremental improvement - Each increment improves on previous model

  • Team collaboration - Living docs make ML decisions visible to all

Advanced: Multi-Increment ML Projects

For complex ML systems (e.g., recommendation system with multiple models):

0042-recommendation-data-pipeline 0043-recommendation-candidate-generation 0044-recommendation-ranking-model 0045-recommendation-reranking 0046-recommendation-ab-test

Each increment:

  • Has its own spec, plan, tasks

  • Builds on previous increments

  • Documents model interactions

  • Maintains system-level living docs

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

technical-writing

No summary provided by upstream source.

Repository SourceNeeds Review
General

spec-driven-brainstorming

No summary provided by upstream source.

Repository SourceNeeds Review
General

kafka-architecture

No summary provided by upstream source.

Repository SourceNeeds Review
General

docusaurus

No summary provided by upstream source.

Repository SourceNeeds Review