cm-mlops-experiment-tracker

Track and compare ML experiments across MLflow, Weights & Biases, or local logs. Organizes runs, compares metrics, identifies best models, detects training anomalies, and recommends hyperparameter configurations. Use when asked to review ML experiments, compare model runs, analyze training metrics, find best hyperparameters, audit experiment tracking setup, or troubleshoot training regressions. Triggers on "mlflow", "wandb", "weights and biases", "experiment tracking", "ml experiments", "model comparison", "hyperparameter", "training metrics", "model registry", "experiment runs".

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "cm-mlops-experiment-tracker" with this command: npx skills add charlie-morrison/mlops-experiment-tracker

MLOps Experiment Tracker

Track, compare, and analyze ML experiments across any tracking backend. Reviews experiment organization, compares run metrics, identifies best-performing models, detects training anomalies, and recommends hyperparameter strategies. Acts as a senior ML engineer auditing your experiment workflow.

Usage

Basic: Analyze the ML experiments in /path/to/mlruns/ Focused: Compare hyperparameters across the top 5 runs | Find training runs with anomalous loss curves | Review model registry for promotion readiness

How It Works

Step 1: Discover Tracking Backend

# Detect MLflow, W&B, or local logs
find /path/to/project -name "mlruns" -type d
grep -r "import mlflow\|import wandb" /path/to/project --include="*.py"
find /path/to/project -name "metrics.json" -o -name "params.json"

Adapts to MLflow (file store, SQLite, PostgreSQL), W&B (local/cloud), local CSV/JSON logs, or TensorBoard event files.

Step 2: Parse Experiment Runs

Extracts structured data per run — parameters, metrics, artifacts, tags, duration, status:

Experiment: "fraud-detection-v2" (ID: 3) | 47 runs | 5 failed

  Run: run_abc123 | FINISHED | 2h 14m
    Params: model=xgboost, lr=0.01, depth=8, estimators=500
    Metrics: auc_roc=0.9423, f1=0.8832, precision=0.8712, recall=0.8956
    Artifacts: model/, confusion_matrix.png, feature_importance.json
    Tags: engineer=jsmith, dataset_version=v3.2, purpose=hyperparameter_sweep

Step 3: Compare Runs and Rank Models

Top 5 Runs by AUC-ROC:
  Rank | Run ID      | AUC-ROC | F1     | Duration
  1    | run_abc123  | 0.9423  | 0.8832 | 2h 14m
  2    | run_def456  | 0.9401  | 0.8801 | 2h 31m
  3    | run_ghi789  | 0.9387  | 0.8790 | 1h 58m

  FINDING: Top 3 within 0.004 AUC — statistically equivalent
  RECOMMEND: Choose run_abc123 — best accuracy at shortest training time

Pareto-Optimal Runs (accuracy vs cost):
  run_ghi789: AUC=0.9387, cost=1h 58m  <-- Best efficiency
  run_abc123: AUC=0.9423, cost=2h 14m  <-- Best accuracy
  run_xyz999: AUC=0.9290, cost=0h 42m  <-- Best budget option

Step 4: Analyze Hyperparameter Impact

Hyperparameter Importance:
  learning_rate    | -0.72 correlation | [0.001, 0.1] tested
  max_depth        | +0.58 correlation | [3, 12] tested
  n_estimators     | +0.45 correlation | [100, 1000] tested
  subsample        | +0.12 correlation | [0.6, 1.0] tested

  KEY: learning_rate dominates. Best range: [0.005, 0.02]
  WARN: max_depth > 10 shows overfitting (train_loss << val_loss)

  NEXT SWEEP: lr=[0.005, 0.02], depth=[7, 9], fix subsample=0.8
  Use Bayesian optimization — 60% fewer runs for same coverage

Step 5: Detect Training Anomalies

  FAIL: run_pqr678 — Severe overfitting
    train_loss=0.012, val_loss=0.389 (32x gap). Diverged at epoch 18.
    FIX: Add regularization or early stopping (patience=10)

  FAIL: run_stu901 — Training diverged
    Loss exploded epoch 12: 0.45 -> 847.32. Cause: lr=0.1 too high.
    FIX: Reduce lr by 10x, add gradient clipping

  WARN: run_vwx234 — Loss plateau for 20 epochs
    Wasted ~1h 20m compute. FIX: early stopping (min_delta=0.001)

  WARN: 5 runs show oscillation — batch_size=16 likely too small

Step 6: Audit Experiment Organization

  FAIL: 12 runs have no tags — cannot determine purpose or engineer
  FAIL: Inconsistent param naming: "lr" (15), "learning_rate" (28), "LR" (4)
  FAIL: No git commit linked to 31/47 runs — cannot reproduce
  WARN: 3 experiments named "test", "debug", "experiment1"
  WARN: 18 runs missing artifact logging (no saved model/config)

Step 7: Model Registry Readiness

  Candidate: run_abc123 (AUC-ROC: 0.9423)
    [x] Model artifact saved    [x] Hyperparameters logged
    [x] Data version recorded   [x] Holdout metrics present
    [ ] No model signature — input/output schema unknown
    [ ] No pip requirements — dependency versions unknown
    [ ] No edge case testing    [ ] No fairness/bias metrics

  Verdict: NOT READY — fix signature and dependency pinning first

Step 8: Final Report

# ML Experiment Analysis Report

## Experiment Health Score: 62/100
  Run organization: 5/10    Hyperparameter search: 7/10
  Training quality: 6/10    Reproducibility: 4/10
  Model registry: 5/10      Resource efficiency: 6/10

## Critical Actions
  1. Standardize parameter naming across training scripts
  2. Enable early stopping — save ~80h compute on future sweeps
  3. Add git commit tracking for reproducibility
  4. Log model signatures before production promotion
  5. Narrow hyperparameter search based on importance analysis

Output

  • Run comparison table ranked by primary metric with Pareto analysis
  • Hyperparameter importance with correlation scores and next-sweep guidance
  • Anomaly detection for overfitting, divergence, plateaus, oscillation
  • Organization audit covering tags, naming, reproducibility gaps
  • Registry readiness checklist for production promotion
  • Health score 0-100 with per-category breakdown

Tips for Best Results

  • Point the agent at your MLflow tracking directory or W&B project
  • Specify the primary metric to optimize (e.g., auc_roc, accuracy, f1)
  • Include training scripts so the agent can correlate code with results
  • Run after every hyperparameter sweep to guide the next one
  • Use before model promotion to catch registry readiness gaps

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

SOTA Agent

SOTA Agent is a public ClawHub SOTA-campaign skill for CV and DS work. Use it when the user says "sota agent", "state of the art benchmark scouting", or want...

Registry SourceRecently Updated
3222Profile unavailable
Automation

Data Science CV Repro Lab

Data Science CV Repro Lab is a public ClawHub CV repro-lab skill. Use it when the user says "cv repro lab", "computer vision reproducibility", "CV experiment...

Registry SourceRecently Updated
3131Profile unavailable
Web3

Numerai Tournament

Autonomous Numerai tournament participation — train models, submit predictions, and earn NMR cryptocurrency.

Registry SourceRecently Updated
3520Profile unavailable
Web3

ML Engineering

Provides end-to-end methodology for defining, engineering, experimenting, deploying, and operating production ML/AI systems at scale.

Registry SourceRecently Updated
5580Profile unavailable