Experiment Tracker

Overview

Transforms chaotic ML experimentation into organized, reproducible research. Every experiment is logged, versioned, and tied to a SpecWeave increment, ensuring team knowledge is preserved and experiments are reproducible.

Problem This Solves

Without structured tracking:

❌ "Which hyperparameters did we use for model v2?"
❌ "Why did we choose XGBoost over LightGBM?"
❌ "Can't reproduce results from 3 months ago"
❌ "Team member left, all knowledge in their notebooks"

With experiment tracking:

✅ All experiments logged with params, metrics, artifacts
✅ Decisions documented ("XGBoost: 5% better precision, chose it")
✅ Reproducible (environment, data version, code hash)
✅ Team knowledge in living docs, not individual notebooks

How It Works

Auto-Configuration

When you create an ML increment, the skill detects tracking tools:

No configuration needed - automatically detects and configures

from specweave import track_experiment

Automatically logs to:

.specweave/increments/0042.../experiments/exp-001/

with track_experiment("baseline-model") as exp: model.fit(X_train, y_train) exp.log_metric("accuracy", accuracy)

Tracking Backends

Option 1: SpecWeave Built-in (default, zero-config)

from specweave import track_experiment

Logs to increment folder automatically

with track_experiment("xgboost-v1") as exp: exp.log_param("n_estimators", 100) exp.log_metric("auc", 0.87) exp.save_model(model, "model.pkl")

Creates:

.specweave/increments/0042.../experiments/xgboost-v1/

├── params.json

├── metrics.json

├── model.pkl

└── metadata.yaml

Option 2: MLflow (if detected in project)

import mlflow from specweave import configure_mlflow

Auto-configures MLflow to log to increment

configure_mlflow(increment="0042")

with mlflow.start_run(run_name="xgboost-v1"): mlflow.log_param("n_estimators", 100) mlflow.log_metric("auc", 0.87) mlflow.sklearn.log_model(model, "model")

Still logs to increment folder, just uses MLflow as backend

Option 3: Weights & Biases

import wandb from specweave import configure_wandb

Auto-configures W&B project = increment ID

configure_wandb(increment="0042")

run = wandb.init(name="xgboost-v1") run.log({"auc": 0.87}) run.log_model("model.pkl")

W&B dashboard + local logs in increment folder

Experiment Comparison

from specweave import compare_experiments

Compare all experiments in increment

comparison = compare_experiments(increment="0042")

Generates:

.specweave/increments/0042.../experiments/comparison.md

Output:

Experiment	Accuracy	Precision	Recall	F1	Training Time
exp-001-baseline	0.65	0.60	0.55	0.57	2s
exp-002-xgboost	0.87	0.85	0.83	0.84	45s
exp-003-lightgbm	0.86	0.84	0.82	0.83	32s
exp-004-neural-net	0.85	0.83	0.81	0.82	320s

Best Model: exp-002-xgboost

Highest accuracy (0.87)
Good precision/recall balance
Reasonable training time (45s)
Selected for deployment

Living Docs Integration

After completing increment:

/sw:sync-docs update

Automatically updates:

Recommendation Model (Increment 0042)

Experiments Conducted: 7

exp-001-baseline: Random classifier (acc=0.12)
exp-002-popularity: Popularity baseline (acc=0.18)
exp-003-xgboost: XGBoost classifier (acc=0.26) ✅ SELECTED
...

Selection Rationale

XGBoost chosen for:

Best accuracy (0.26 vs baseline 0.18, +44% improvement)
Fast inference (<50ms)
Good explainability (SHAP values)
Stable across cross-validation (std=0.02)

Hyperparameters (exp-003)

n_estimators: 200
max_depth: 6
learning_rate: 0.1
subsample: 0.8

When to Use This Skill

Activate when you need to:

Track ML experiments systematically
Compare multiple models objectively
Document experiment decisions for team
Reproduce past results exactly
Maintain experiment history across increments

Key Features

Automatic Logging

Logs everything automatically

from specweave import AutoTracker

tracker = AutoTracker(increment="0042")

Just wrap your training code

@tracker.track(name="xgboost-auto") def train_model(): model = XGBClassifier(**params) model.fit(X_train, y_train) score = model.score(X_test, y_test) return model, score

Automatically logs: params, metrics, model, environment, git hash

model, score = train_model()

Hyperparameter Tracking

from specweave import track_hyperparameters

params_grid = { "n_estimators": [100, 200, 500], "max_depth": [3, 6, 9], "learning_rate": [0.01, 0.1, 0.3] }

Tracks all parameter combinations

results = track_hyperparameters( model=XGBClassifier, param_grid=params_grid, X_train=X_train, y_train=y_train, increment="0042" )

Generates parameter importance analysis

Cross-Validation Tracking

from specweave import track_cross_validation

Tracks each fold separately

cv_results = track_cross_validation( model=model, X=X, y=y, cv=5, increment="0042" )

Logs: mean, std, per-fold scores, fold distribution

Artifact Management

from specweave import track_artifacts

with track_experiment("xgboost-v1") as exp: # Training artifacts exp.save_artifact("preprocessor.pkl", preprocessor) exp.save_artifact("model.pkl", model)

# Evaluation artifacts
exp.save_artifact("confusion_matrix.png", cm_plot)
exp.save_artifact("roc_curve.png", roc_plot)

# Data artifacts
exp.save_artifact("feature_importance.csv", importance_df)

# Environment artifacts
exp.save_artifact("requirements.txt", requirements)
exp.save_artifact("conda_env.yaml", conda_env)

5. Experiment Metadata

from specweave import ExperimentMetadata

metadata = ExperimentMetadata( name="xgboost-v3", description="XGBoost with feature engineering v2", tags=["production-candidate", "feature-eng-v2"], git_commit="a3b8c9d", data_version="v2024-01", author="[email protected]" )

with track_experiment(metadata) as exp: # ... training ... pass

Best Practices

Name Experiments Clearly

❌ Bad: Generic names

with track_experiment("exp1"): ...

✅ Good: Descriptive names

with track_experiment("xgboost-tuned-depth6-lr0.1"): ...

Log Everything

Log more than you think you need

exp.log_param("random_seed", 42) exp.log_param("data_version", "2024-01") exp.log_param("python_version", sys.version) exp.log_param("sklearn_version", sklearn.version)

Future you will thank present you

Document Failures

try: with track_experiment("neural-net-attempt") as exp: model.fit(X_train, y_train) except Exception as e: exp.log_note(f"FAILED: {str(e)}") exp.log_note("Reason: Out of memory, need smaller batch size") exp.set_status("failed")

Failure documentation prevents repeating mistakes

Use Experiment Series

Related experiments in series

experiments = [ "xgboost-baseline", "xgboost-tuned-v1", "xgboost-tuned-v2", "xgboost-tuned-v3-final" ]

Track progression and improvements

Link to Data Versions

with track_experiment("xgboost-v1") as exp: exp.log_param("data_commit", "dvc:a3b8c9d") exp.log_param("data_url", "s3://bucket/data/v2024-01")

Enables exact reproduction

Integration with SpecWeave

With Increments

Experiments automatically tied to increment

/sw:inc "0042-recommendation-model"

All experiments logged to: .specweave/increments/0042.../experiments/

With Living Docs

Sync experiment findings to docs

/sw:sync-docs update

Updates: architecture/ml-models.md, runbooks/model-training.md

With GitHub

Create issue for model retraining

/sw:github:create-issue "Retrain model with Q1 2024 data"

Links to previous experiments in increment

Examples

Example 1: Baseline Experiments

from specweave import track_experiment

baselines = ["random", "majority", "stratified"]

for strategy in baselines: with track_experiment(f"baseline-{strategy}") as exp: model = DummyClassifier(strategy=strategy) model.fit(X_train, y_train)

    accuracy = model.score(X_test, y_test)
    exp.log_metric("accuracy", accuracy)
    exp.log_note(f"Baseline: {strategy}")

Generates baseline comparison report

Example 2: Hyperparameter Grid Search

from sklearn.model_selection import GridSearchCV from specweave import track_grid_search

param_grid = { "n_estimators": [100, 200, 500], "max_depth": [3, 6, 9] }

Automatically logs all combinations

best_model, results = track_grid_search( XGBClassifier(), param_grid, X_train, y_train, increment="0042" )

Creates visualization of parameter importance

Example 3: Model Comparison

from specweave import compare_models

models = { "xgboost": XGBClassifier(), "lightgbm": LGBMClassifier(), "random-forest": RandomForestClassifier() }

Trains and compares all models

comparison = compare_models( models, X_train, y_train, X_test, y_test, increment="0042" )

Generates markdown comparison table

Tool Compatibility

MLflow

Option 1: Pure MLflow (auto-configured)

import mlflow mlflow.set_tracking_uri(".specweave/increments/0042.../experiments")

Option 2: SpecWeave wrapper (recommended)

from specweave import mlflow as sw_mlflow with sw_mlflow.start_run("xgboost"): # Logs to both MLflow and increment docs pass

Weights & Biases

Option 1: Pure wandb

import wandb wandb.init(project="0042-recommendation-model")

Option 2: SpecWeave wrapper (recommended)

from specweave import wandb as sw_wandb run = sw_wandb.init(increment="0042", name="xgboost")

Syncs to increment folder + W&B dashboard

TensorBoard

from specweave import TensorBoardCallback

Keras callback

model.fit( X_train, y_train, callbacks=[ TensorBoardCallback( increment="0042", log_dir=".specweave/increments/0042.../tensorboard" ) ] )

Commands

List all experiments in increment

/ml:list-experiments 0042

Compare experiments

/ml:compare-experiments 0042

Load experiment details

/ml:show-experiment exp-003-xgboost

Export experiment data

/ml:export-experiments 0042 --format csv

Tips

Start tracking early - Track from first experiment, not after 20 failed attempts
Tag production models - exp.add_tag("production") for deployed models
Version everything - Data, code, environment, dependencies
Document decisions - Why model A over model B (not just metrics)
Prune old experiments - Archive experiments >6 months old

Advanced: Multi-Stage Experiments

For complex pipelines with multiple stages:

from specweave import ExperimentPipeline

pipeline = ExperimentPipeline("recommendation-full-pipeline")

Stage 1: Data preprocessing

with pipeline.stage("preprocessing") as stage: stage.log_metric("rows_before", len(df)) df_clean = preprocess(df) stage.log_metric("rows_after", len(df_clean))

Stage 2: Feature engineering

with pipeline.stage("features") as stage: features = engineer_features(df_clean) stage.log_metric("num_features", features.shape[1])

Stage 3: Model training

with pipeline.stage("training") as stage: model = train_model(features) stage.log_metric("accuracy", accuracy)

Logs entire pipeline with stage dependencies

Integration Points

ml-pipeline-orchestrator: Auto-tracks experiments during pipeline execution
model-evaluator: Uses experiment data for model comparison
ml-engineer agent: Reviews experiment results and suggests improvements
Living docs: Syncs experiment findings to architecture docs

This skill ensures ML experimentation is never lost, always reproducible, and well-documented.

experiment-tracker

Safety Notice

Copy this and send it to your AI assistant to learn

No configuration needed - automatically detects and configures

Automatically logs to:

.specweave/increments/0042.../experiments/exp-001/

Logs to increment folder automatically

Creates:

.specweave/increments/0042.../experiments/xgboost-v1/

├── params.json

├── metrics.json

├── model.pkl

└── metadata.yaml

Auto-configures MLflow to log to increment

Still logs to increment folder, just uses MLflow as backend

Auto-configures W&B project = increment ID

W&B dashboard + local logs in increment folder

Compare all experiments in increment

Generates:

.specweave/increments/0042.../experiments/comparison.md

Recommendation Model (Increment 0042)

Experiments Conducted: 7

Selection Rationale

Hyperparameters (exp-003)

Logs everything automatically

Just wrap your training code

Automatically logs: params, metrics, model, environment, git hash

Tracks all parameter combinations

Generates parameter importance analysis

Tracks each fold separately

Logs: mean, std, per-fold scores, fold distribution

❌ Bad: Generic names

✅ Good: Descriptive names

Log more than you think you need

Future you will thank present you

Failure documentation prevents repeating mistakes

Related experiments in series

Track progression and improvements

Enables exact reproduction

Experiments automatically tied to increment

All experiments logged to: .specweave/increments/0042.../experiments/

Sync experiment findings to docs

Updates: architecture/ml-models.md, runbooks/model-training.md

Create issue for model retraining

Links to previous experiments in increment

Generates baseline comparison report

Automatically logs all combinations

Creates visualization of parameter importance

Trains and compares all models

Generates markdown comparison table

Option 1: Pure MLflow (auto-configured)

Option 2: SpecWeave wrapper (recommended)

Option 1: Pure wandb

Option 2: SpecWeave wrapper (recommended)

Syncs to increment folder + W&B dashboard

Keras callback

List all experiments in increment

Compare experiments

Load experiment details

Export experiment data

Stage 1: Data preprocessing

Stage 2: Feature engineering

Stage 3: Model training

Logs entire pipeline with stage dependencies

Source Transparency

Related Skills

technical-writing

spec-driven-brainstorming

kafka-architecture