MLOps Experiment Tracker
Track, compare, and analyze ML experiments across any tracking backend. Reviews experiment organization, compares run metrics, identifies best-performing models, detects training anomalies, and recommends hyperparameter strategies. Acts as a senior ML engineer auditing your experiment workflow.
Usage
Basic: Analyze the ML experiments in /path/to/mlruns/
Focused: Compare hyperparameters across the top 5 runs | Find training runs with anomalous loss curves | Review model registry for promotion readiness
How It Works
Step 1: Discover Tracking Backend
# Detect MLflow, W&B, or local logs
find /path/to/project -name "mlruns" -type d
grep -r "import mlflow\|import wandb" /path/to/project --include="*.py"
find /path/to/project -name "metrics.json" -o -name "params.json"
Adapts to MLflow (file store, SQLite, PostgreSQL), W&B (local/cloud), local CSV/JSON logs, or TensorBoard event files.
Step 2: Parse Experiment Runs
Extracts structured data per run — parameters, metrics, artifacts, tags, duration, status:
Experiment: "fraud-detection-v2" (ID: 3) | 47 runs | 5 failed
Run: run_abc123 | FINISHED | 2h 14m
Params: model=xgboost, lr=0.01, depth=8, estimators=500
Metrics: auc_roc=0.9423, f1=0.8832, precision=0.8712, recall=0.8956
Artifacts: model/, confusion_matrix.png, feature_importance.json
Tags: engineer=jsmith, dataset_version=v3.2, purpose=hyperparameter_sweep
Step 3: Compare Runs and Rank Models
Top 5 Runs by AUC-ROC:
Rank | Run ID | AUC-ROC | F1 | Duration
1 | run_abc123 | 0.9423 | 0.8832 | 2h 14m
2 | run_def456 | 0.9401 | 0.8801 | 2h 31m
3 | run_ghi789 | 0.9387 | 0.8790 | 1h 58m
FINDING: Top 3 within 0.004 AUC — statistically equivalent
RECOMMEND: Choose run_abc123 — best accuracy at shortest training time
Pareto-Optimal Runs (accuracy vs cost):
run_ghi789: AUC=0.9387, cost=1h 58m <-- Best efficiency
run_abc123: AUC=0.9423, cost=2h 14m <-- Best accuracy
run_xyz999: AUC=0.9290, cost=0h 42m <-- Best budget option
Step 4: Analyze Hyperparameter Impact
Hyperparameter Importance:
learning_rate | -0.72 correlation | [0.001, 0.1] tested
max_depth | +0.58 correlation | [3, 12] tested
n_estimators | +0.45 correlation | [100, 1000] tested
subsample | +0.12 correlation | [0.6, 1.0] tested
KEY: learning_rate dominates. Best range: [0.005, 0.02]
WARN: max_depth > 10 shows overfitting (train_loss << val_loss)
NEXT SWEEP: lr=[0.005, 0.02], depth=[7, 9], fix subsample=0.8
Use Bayesian optimization — 60% fewer runs for same coverage
Step 5: Detect Training Anomalies
FAIL: run_pqr678 — Severe overfitting
train_loss=0.012, val_loss=0.389 (32x gap). Diverged at epoch 18.
FIX: Add regularization or early stopping (patience=10)
FAIL: run_stu901 — Training diverged
Loss exploded epoch 12: 0.45 -> 847.32. Cause: lr=0.1 too high.
FIX: Reduce lr by 10x, add gradient clipping
WARN: run_vwx234 — Loss plateau for 20 epochs
Wasted ~1h 20m compute. FIX: early stopping (min_delta=0.001)
WARN: 5 runs show oscillation — batch_size=16 likely too small
Step 6: Audit Experiment Organization
FAIL: 12 runs have no tags — cannot determine purpose or engineer
FAIL: Inconsistent param naming: "lr" (15), "learning_rate" (28), "LR" (4)
FAIL: No git commit linked to 31/47 runs — cannot reproduce
WARN: 3 experiments named "test", "debug", "experiment1"
WARN: 18 runs missing artifact logging (no saved model/config)
Step 7: Model Registry Readiness
Candidate: run_abc123 (AUC-ROC: 0.9423)
[x] Model artifact saved [x] Hyperparameters logged
[x] Data version recorded [x] Holdout metrics present
[ ] No model signature — input/output schema unknown
[ ] No pip requirements — dependency versions unknown
[ ] No edge case testing [ ] No fairness/bias metrics
Verdict: NOT READY — fix signature and dependency pinning first
Step 8: Final Report
# ML Experiment Analysis Report
## Experiment Health Score: 62/100
Run organization: 5/10 Hyperparameter search: 7/10
Training quality: 6/10 Reproducibility: 4/10
Model registry: 5/10 Resource efficiency: 6/10
## Critical Actions
1. Standardize parameter naming across training scripts
2. Enable early stopping — save ~80h compute on future sweeps
3. Add git commit tracking for reproducibility
4. Log model signatures before production promotion
5. Narrow hyperparameter search based on importance analysis
Output
- Run comparison table ranked by primary metric with Pareto analysis
- Hyperparameter importance with correlation scores and next-sweep guidance
- Anomaly detection for overfitting, divergence, plateaus, oscillation
- Organization audit covering tags, naming, reproducibility gaps
- Registry readiness checklist for production promotion
- Health score 0-100 with per-category breakdown
Tips for Best Results
- Point the agent at your MLflow tracking directory or W&B project
- Specify the primary metric to optimize (e.g., auc_roc, accuracy, f1)
- Include training scripts so the agent can correlate code with results
- Run after every hyperparameter sweep to guide the next one
- Use before model promotion to catch registry readiness gaps