MLOps Experiment Tracker

Track, compare, and analyze ML experiments across any tracking backend. Reviews experiment organization, compares run metrics, identifies best-performing models, detects training anomalies, and recommends hyperparameter strategies. Acts as a senior ML engineer auditing your experiment workflow.

Usage

Basic: Analyze the ML experiments in /path/to/mlruns/ Focused: Compare hyperparameters across the top 5 runs | Find training runs with anomalous loss curves | Review model registry for promotion readiness

How It Works

Step 1: Discover Tracking Backend

# Detect MLflow, W&B, or local logs
find /path/to/project -name "mlruns" -type d
grep -r "import mlflow\|import wandb" /path/to/project --include="*.py"
find /path/to/project -name "metrics.json" -o -name "params.json"

Adapts to MLflow (file store, SQLite, PostgreSQL), W&B (local/cloud), local CSV/JSON logs, or TensorBoard event files.

Step 2: Parse Experiment Runs

Extracts structured data per run — parameters, metrics, artifacts, tags, duration, status:

Experiment: "fraud-detection-v2" (ID: 3) | 47 runs | 5 failed

  Run: run_abc123 | FINISHED | 2h 14m
    Params: model=xgboost, lr=0.01, depth=8, estimators=500
    Metrics: auc_roc=0.9423, f1=0.8832, precision=0.8712, recall=0.8956
    Artifacts: model/, confusion_matrix.png, feature_importance.json
    Tags: engineer=jsmith, dataset_version=v3.2, purpose=hyperparameter_sweep

Step 3: Compare Runs and Rank Models

Top 5 Runs by AUC-ROC:
  Rank | Run ID      | AUC-ROC | F1     | Duration
  1    | run_abc123  | 0.9423  | 0.8832 | 2h 14m
  2    | run_def456  | 0.9401  | 0.8801 | 2h 31m
  3    | run_ghi789  | 0.9387  | 0.8790 | 1h 58m

  FINDING: Top 3 within 0.004 AUC — statistically equivalent
  RECOMMEND: Choose run_abc123 — best accuracy at shortest training time

Pareto-Optimal Runs (accuracy vs cost):
  run_ghi789: AUC=0.9387, cost=1h 58m  <-- Best efficiency
  run_abc123: AUC=0.9423, cost=2h 14m  <-- Best accuracy
  run_xyz999: AUC=0.9290, cost=0h 42m  <-- Best budget option

Step 4: Analyze Hyperparameter Impact

Hyperparameter Importance:
  learning_rate    | -0.72 correlation | [0.001, 0.1] tested
  max_depth        | +0.58 correlation | [3, 12] tested
  n_estimators     | +0.45 correlation | [100, 1000] tested
  subsample        | +0.12 correlation | [0.6, 1.0] tested

  KEY: learning_rate dominates. Best range: [0.005, 0.02]
  WARN: max_depth > 10 shows overfitting (train_loss << val_loss)

  NEXT SWEEP: lr=[0.005, 0.02], depth=[7, 9], fix subsample=0.8
  Use Bayesian optimization — 60% fewer runs for same coverage

Step 5: Detect Training Anomalies

  FAIL: run_pqr678 — Severe overfitting
    train_loss=0.012, val_loss=0.389 (32x gap). Diverged at epoch 18.
    FIX: Add regularization or early stopping (patience=10)

  FAIL: run_stu901 — Training diverged
    Loss exploded epoch 12: 0.45 -> 847.32. Cause: lr=0.1 too high.
    FIX: Reduce lr by 10x, add gradient clipping

  WARN: run_vwx234 — Loss plateau for 20 epochs
    Wasted ~1h 20m compute. FIX: early stopping (min_delta=0.001)

  WARN: 5 runs show oscillation — batch_size=16 likely too small

Step 6: Audit Experiment Organization

  FAIL: 12 runs have no tags — cannot determine purpose or engineer
  FAIL: Inconsistent param naming: "lr" (15), "learning_rate" (28), "LR" (4)
  FAIL: No git commit linked to 31/47 runs — cannot reproduce
  WARN: 3 experiments named "test", "debug", "experiment1"
  WARN: 18 runs missing artifact logging (no saved model/config)

Step 7: Model Registry Readiness

  Candidate: run_abc123 (AUC-ROC: 0.9423)
    [x] Model artifact saved    [x] Hyperparameters logged
    [x] Data version recorded   [x] Holdout metrics present
    [ ] No model signature — input/output schema unknown
    [ ] No pip requirements — dependency versions unknown
    [ ] No edge case testing    [ ] No fairness/bias metrics

  Verdict: NOT READY — fix signature and dependency pinning first

Step 8: Final Report

# ML Experiment Analysis Report

## Experiment Health Score: 62/100
  Run organization: 5/10    Hyperparameter search: 7/10
  Training quality: 6/10    Reproducibility: 4/10
  Model registry: 5/10      Resource efficiency: 6/10

## Critical Actions
  1. Standardize parameter naming across training scripts
  2. Enable early stopping — save ~80h compute on future sweeps
  3. Add git commit tracking for reproducibility
  4. Log model signatures before production promotion
  5. Narrow hyperparameter search based on importance analysis

Output

Run comparison table ranked by primary metric with Pareto analysis
Hyperparameter importance with correlation scores and next-sweep guidance
Anomaly detection for overfitting, divergence, plateaus, oscillation
Organization audit covering tags, naming, reproducibility gaps
Registry readiness checklist for production promotion
Health score 0-100 with per-category breakdown

Tips for Best Results

Point the agent at your MLflow tracking directory or W&B project
Specify the primary metric to optimize (e.g., auc_roc, accuracy, f1)
Include training scripts so the agent can correlate code with results
Run after every hyperparameter sweep to guide the next one
Use before model promotion to catch registry readiness gaps

cm-mlops-experiment-tracker

Safety Notice

Copy this and send it to your AI assistant to learn