experiment-tracker

Transforms chaotic ML experimentation into organized, reproducible research. Every experiment is logged, versioned, and tied to a SpecWeave increment, ensuring team knowledge is preserved and experiments are reproducible.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "experiment-tracker" with this command: npx skills add anton-abyzov/specweave/anton-abyzov-specweave-experiment-tracker

Experiment Tracker

Overview

Transforms chaotic ML experimentation into organized, reproducible research. Every experiment is logged, versioned, and tied to a SpecWeave increment, ensuring team knowledge is preserved and experiments are reproducible.

Problem This Solves

Without structured tracking:

  • ❌ "Which hyperparameters did we use for model v2?"

  • ❌ "Why did we choose XGBoost over LightGBM?"

  • ❌ "Can't reproduce results from 3 months ago"

  • ❌ "Team member left, all knowledge in their notebooks"

With experiment tracking:

  • ✅ All experiments logged with params, metrics, artifacts

  • ✅ Decisions documented ("XGBoost: 5% better precision, chose it")

  • ✅ Reproducible (environment, data version, code hash)

  • ✅ Team knowledge in living docs, not individual notebooks

How It Works

Auto-Configuration

When you create an ML increment, the skill detects tracking tools:

No configuration needed - automatically detects and configures

from specweave import track_experiment

Automatically logs to:

.specweave/increments/0042.../experiments/exp-001/

with track_experiment("baseline-model") as exp: model.fit(X_train, y_train) exp.log_metric("accuracy", accuracy)

Tracking Backends

Option 1: SpecWeave Built-in (default, zero-config)

from specweave import track_experiment

Logs to increment folder automatically

with track_experiment("xgboost-v1") as exp: exp.log_param("n_estimators", 100) exp.log_metric("auc", 0.87) exp.save_model(model, "model.pkl")

Creates:

.specweave/increments/0042.../experiments/xgboost-v1/

├── params.json

├── metrics.json

├── model.pkl

└── metadata.yaml

Option 2: MLflow (if detected in project)

import mlflow from specweave import configure_mlflow

Auto-configures MLflow to log to increment

configure_mlflow(increment="0042")

with mlflow.start_run(run_name="xgboost-v1"): mlflow.log_param("n_estimators", 100) mlflow.log_metric("auc", 0.87) mlflow.sklearn.log_model(model, "model")

Still logs to increment folder, just uses MLflow as backend

Option 3: Weights & Biases

import wandb from specweave import configure_wandb

Auto-configures W&B project = increment ID

configure_wandb(increment="0042")

run = wandb.init(name="xgboost-v1") run.log({"auc": 0.87}) run.log_model("model.pkl")

W&B dashboard + local logs in increment folder

Experiment Comparison

from specweave import compare_experiments

Compare all experiments in increment

comparison = compare_experiments(increment="0042")

Generates:

.specweave/increments/0042.../experiments/comparison.md

Output:

ExperimentAccuracyPrecisionRecallF1Training Time
exp-001-baseline0.650.600.550.572s
exp-002-xgboost0.870.850.830.8445s
exp-003-lightgbm0.860.840.820.8332s
exp-004-neural-net0.850.830.810.82320s

Best Model: exp-002-xgboost

  • Highest accuracy (0.87)
  • Good precision/recall balance
  • Reasonable training time (45s)
  • Selected for deployment

Living Docs Integration

After completing increment:

/sw:sync-docs update

Automatically updates:

<!-- .specweave/docs/internal/architecture/ml-experiments.md -->

Recommendation Model (Increment 0042)

Experiments Conducted: 7

  • exp-001-baseline: Random classifier (acc=0.12)
  • exp-002-popularity: Popularity baseline (acc=0.18)
  • exp-003-xgboost: XGBoost classifier (acc=0.26) ✅ SELECTED
  • ...

Selection Rationale

XGBoost chosen for:

  • Best accuracy (0.26 vs baseline 0.18, +44% improvement)
  • Fast inference (<50ms)
  • Good explainability (SHAP values)
  • Stable across cross-validation (std=0.02)

Hyperparameters (exp-003)

  • n_estimators: 200
  • max_depth: 6
  • learning_rate: 0.1
  • subsample: 0.8

When to Use This Skill

Activate when you need to:

  • Track ML experiments systematically

  • Compare multiple models objectively

  • Document experiment decisions for team

  • Reproduce past results exactly

  • Maintain experiment history across increments

Key Features

  1. Automatic Logging

Logs everything automatically

from specweave import AutoTracker

tracker = AutoTracker(increment="0042")

Just wrap your training code

@tracker.track(name="xgboost-auto") def train_model(): model = XGBClassifier(**params) model.fit(X_train, y_train) score = model.score(X_test, y_test) return model, score

Automatically logs: params, metrics, model, environment, git hash

model, score = train_model()

  1. Hyperparameter Tracking

from specweave import track_hyperparameters

params_grid = { "n_estimators": [100, 200, 500], "max_depth": [3, 6, 9], "learning_rate": [0.01, 0.1, 0.3] }

Tracks all parameter combinations

results = track_hyperparameters( model=XGBClassifier, param_grid=params_grid, X_train=X_train, y_train=y_train, increment="0042" )

Generates parameter importance analysis

  1. Cross-Validation Tracking

from specweave import track_cross_validation

Tracks each fold separately

cv_results = track_cross_validation( model=model, X=X, y=y, cv=5, increment="0042" )

Logs: mean, std, per-fold scores, fold distribution

  1. Artifact Management

from specweave import track_artifacts

with track_experiment("xgboost-v1") as exp: # Training artifacts exp.save_artifact("preprocessor.pkl", preprocessor) exp.save_artifact("model.pkl", model)

# Evaluation artifacts
exp.save_artifact("confusion_matrix.png", cm_plot)
exp.save_artifact("roc_curve.png", roc_plot)

# Data artifacts
exp.save_artifact("feature_importance.csv", importance_df)

# Environment artifacts
exp.save_artifact("requirements.txt", requirements)
exp.save_artifact("conda_env.yaml", conda_env)

5. Experiment Metadata

from specweave import ExperimentMetadata

metadata = ExperimentMetadata( name="xgboost-v3", description="XGBoost with feature engineering v2", tags=["production-candidate", "feature-eng-v2"], git_commit="a3b8c9d", data_version="v2024-01", author="[email protected]" )

with track_experiment(metadata) as exp: # ... training ... pass

Best Practices

  1. Name Experiments Clearly

❌ Bad: Generic names

with track_experiment("exp1"): ...

✅ Good: Descriptive names

with track_experiment("xgboost-tuned-depth6-lr0.1"): ...

  1. Log Everything

Log more than you think you need

exp.log_param("random_seed", 42) exp.log_param("data_version", "2024-01") exp.log_param("python_version", sys.version) exp.log_param("sklearn_version", sklearn.version)

Future you will thank present you

  1. Document Failures

try: with track_experiment("neural-net-attempt") as exp: model.fit(X_train, y_train) except Exception as e: exp.log_note(f"FAILED: {str(e)}") exp.log_note("Reason: Out of memory, need smaller batch size") exp.set_status("failed")

Failure documentation prevents repeating mistakes

  1. Use Experiment Series

Related experiments in series

experiments = [ "xgboost-baseline", "xgboost-tuned-v1", "xgboost-tuned-v2", "xgboost-tuned-v3-final" ]

Track progression and improvements

  1. Link to Data Versions

with track_experiment("xgboost-v1") as exp: exp.log_param("data_commit", "dvc:a3b8c9d") exp.log_param("data_url", "s3://bucket/data/v2024-01")

Enables exact reproduction

Integration with SpecWeave

With Increments

Experiments automatically tied to increment

/sw:inc "0042-recommendation-model"

All experiments logged to: .specweave/increments/0042.../experiments/

With Living Docs

Sync experiment findings to docs

/sw:sync-docs update

Updates: architecture/ml-models.md, runbooks/model-training.md

With GitHub

Create issue for model retraining

/sw:github:create-issue "Retrain model with Q1 2024 data"

Links to previous experiments in increment

Examples

Example 1: Baseline Experiments

from specweave import track_experiment

baselines = ["random", "majority", "stratified"]

for strategy in baselines: with track_experiment(f"baseline-{strategy}") as exp: model = DummyClassifier(strategy=strategy) model.fit(X_train, y_train)

    accuracy = model.score(X_test, y_test)
    exp.log_metric("accuracy", accuracy)
    exp.log_note(f"Baseline: {strategy}")

Generates baseline comparison report

Example 2: Hyperparameter Grid Search

from sklearn.model_selection import GridSearchCV from specweave import track_grid_search

param_grid = { "n_estimators": [100, 200, 500], "max_depth": [3, 6, 9] }

Automatically logs all combinations

best_model, results = track_grid_search( XGBClassifier(), param_grid, X_train, y_train, increment="0042" )

Creates visualization of parameter importance

Example 3: Model Comparison

from specweave import compare_models

models = { "xgboost": XGBClassifier(), "lightgbm": LGBMClassifier(), "random-forest": RandomForestClassifier() }

Trains and compares all models

comparison = compare_models( models, X_train, y_train, X_test, y_test, increment="0042" )

Generates markdown comparison table

Tool Compatibility

MLflow

Option 1: Pure MLflow (auto-configured)

import mlflow mlflow.set_tracking_uri(".specweave/increments/0042.../experiments")

Option 2: SpecWeave wrapper (recommended)

from specweave import mlflow as sw_mlflow with sw_mlflow.start_run("xgboost"): # Logs to both MLflow and increment docs pass

Weights & Biases

Option 1: Pure wandb

import wandb wandb.init(project="0042-recommendation-model")

Option 2: SpecWeave wrapper (recommended)

from specweave import wandb as sw_wandb run = sw_wandb.init(increment="0042", name="xgboost")

Syncs to increment folder + W&B dashboard

TensorBoard

from specweave import TensorBoardCallback

Keras callback

model.fit( X_train, y_train, callbacks=[ TensorBoardCallback( increment="0042", log_dir=".specweave/increments/0042.../tensorboard" ) ] )

Commands

List all experiments in increment

/ml:list-experiments 0042

Compare experiments

/ml:compare-experiments 0042

Load experiment details

/ml:show-experiment exp-003-xgboost

Export experiment data

/ml:export-experiments 0042 --format csv

Tips

  • Start tracking early - Track from first experiment, not after 20 failed attempts

  • Tag production models - exp.add_tag("production") for deployed models

  • Version everything - Data, code, environment, dependencies

  • Document decisions - Why model A over model B (not just metrics)

  • Prune old experiments - Archive experiments >6 months old

Advanced: Multi-Stage Experiments

For complex pipelines with multiple stages:

from specweave import ExperimentPipeline

pipeline = ExperimentPipeline("recommendation-full-pipeline")

Stage 1: Data preprocessing

with pipeline.stage("preprocessing") as stage: stage.log_metric("rows_before", len(df)) df_clean = preprocess(df) stage.log_metric("rows_after", len(df_clean))

Stage 2: Feature engineering

with pipeline.stage("features") as stage: features = engineer_features(df_clean) stage.log_metric("num_features", features.shape[1])

Stage 3: Model training

with pipeline.stage("training") as stage: model = train_model(features) stage.log_metric("accuracy", accuracy)

Logs entire pipeline with stage dependencies

Integration Points

  • ml-pipeline-orchestrator: Auto-tracks experiments during pipeline execution

  • model-evaluator: Uses experiment data for model comparison

  • ml-engineer agent: Reviews experiment results and suggests improvements

  • Living docs: Syncs experiment findings to architecture docs

This skill ensures ML experimentation is never lost, always reproducible, and well-documented.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

technical-writing

No summary provided by upstream source.

Repository SourceNeeds Review
General

spec-driven-brainstorming

No summary provided by upstream source.

Repository SourceNeeds Review
General

kafka-architecture

No summary provided by upstream source.

Repository SourceNeeds Review