ML Fail-Fast Validation

POC validation patterns to catch issues before committing to long-running ML experiments.

When to Use This Skill

Use this skill when:

Starting a new ML experiment that will run for hours
Validating model architecture before full training
Checking gradient flow and data pipeline integrity
Implementing POC validation checklists
Debugging prediction collapse or gradient explosion issues

Why Fail-Fast?

Without Fail-Fast With Fail-Fast

Discover crash 4 hours in Catch in 30 seconds

Debug from cryptic error Clear error message

Lose GPU time Validate before commit

Silent data issues Explicit schema checks

Principle: Validate everything that can go wrong BEFORE the expensive computation.

POC Validation Checklist

Minimum Viable POC (5 Checks)

def run_poc_validation(): """Fast validation before full experiment."""

print("=" * 60)
print("FAIL-FAST POC VALIDATION")
print("=" * 60)

# [1/5] Model instantiation
print("\n[1/5] Model instantiation...")
model = create_model(architecture, input_size=n_features)
x = torch.randn(32, seq_len, n_features).to(device)
out = model(x)
assert out.shape == (32, 1), f"Output shape wrong: {out.shape}"
print(f"   Input: (32, {seq_len}, {n_features}) -> Output: {out.shape}")
print("   Status: PASS")

# [2/5] Gradient flow
print("\n[2/5] Gradient flow...")
y = torch.randn(32, 1).to(device)
loss = F.mse_loss(out, y)
loss.backward()
grad_norms = [p.grad.norm().item() for p in model.parameters() if p.grad is not None]
assert len(grad_norms) > 0, "No gradients!"
assert all(np.isfinite(g) for g in grad_norms), "NaN/Inf gradients!"
print(f"   Max grad norm: {max(grad_norms):.4f}")
print("   Status: PASS")

# [3/5] NDJSON artifact validation
print("\n[3/5] NDJSON artifact validation...")
log_path = output_dir / "experiment.jsonl"
with open(log_path, "a") as f:
    f.write(json.dumps({"phase": "poc_start", "timestamp": datetime.now().isoformat()}) + "\n")
assert log_path.exists(), "Log file not created"
print(f"   Log file: {log_path}")
print("   Status: PASS")

# [4/5] Epoch selector variation
print("\n[4/5] Epoch selector variation...")
epochs = []
for seed in [1, 2, 3]:
    selector = create_selector()
    # Simulate different validation results
    for e in range(10, 201, 10):
        selector.record(epoch=e, sortino=np.random.randn() * 0.1, sparsity=np.random.rand())
    epochs.append(selector.select())
print(f"   Selected epochs: {epochs}")
assert len(set(epochs)) > 1 or all(e == epochs[0] for e in epochs), "Selector not varying"
print("   Status: PASS")

# [5/5] Mini training (10 epochs)
print("\n[5/5] Mini training (10 epochs)...")
model = create_model(architecture, input_size=n_features).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.0005)
initial_loss = None
for epoch in range(10):
    loss = train_one_epoch(model, train_loader, optimizer)
    if initial_loss is None:
        initial_loss = loss
print(f"   Initial loss: {initial_loss:.4f}")
print(f"   Final loss: {loss:.4f}")
print("   Status: PASS")

print("\n" + "=" * 60)
print("POC RESULT: ALL 5 CHECKS PASSED")
print("=" * 60)

Extended POC (10 Checks)

Add these for comprehensive validation:

[6/10] Data loading

print("\n[6/10] Data loading...") df = fetch_data(symbol, threshold) assert len(df) > min_required_bars, f"Insufficient data: {len(df)} bars" print(f" Loaded: {len(df):,} bars") print(" Status: PASS")

[7/10] Schema validation

print("\n[7/10] Schema validation...") validate_schema(df, required_columns, "raw_data") print(" Status: PASS")

[8/10] Feature computation

print("\n[8/10] Feature computation...") df = compute_features(df) validate_schema(df, feature_columns, "features") print(f" Features: {len(feature_columns)}") print(" Status: PASS")

[9/10] Prediction sanity

print("\n[9/10] Prediction sanity...") preds = model(X_test).detach().cpu().numpy() pred_std = preds.std() target_std = y_test.std() pred_ratio = pred_std / target_std assert pred_ratio > 0.005, f"Predictions collapsed: ratio={pred_ratio:.4f}" print(f" Pred std ratio: {pred_ratio:.2%}") print(" Status: PASS")

[10/10] Checkpoint save/load

print("\n[10/10] Checkpoint save/load...") torch.save(model.state_dict(), checkpoint_path) model2 = create_model(architecture, input_size=n_features) model2.load_state_dict(torch.load(checkpoint_path)) print(" Status: PASS")

Schema Validation Pattern

The Problem

BAD: Cryptic error 2 hours into experiment

KeyError: 'returns_vs' # Which file? Which function? What columns exist?

The Solution

def validate_schema(df, required: list[str], stage: str) -> None: """Fail-fast schema validation with actionable error messages.""" # Handle both DataFrame columns and DatetimeIndex available = list(df.columns) if hasattr(df.index, 'name') and df.index.name: available.append(df.index.name)

missing = [c for c in required if c not in available]
if missing:
    raise ValueError(
        f"[{stage}] Missing columns: {missing}\n"
        f"Available: {sorted(available)}\n"
        f"DataFrame shape: {df.shape}"
    )
print(f"  Schema validation PASSED ({stage}): {len(required)} columns", flush=True)

Usage at pipeline boundaries

REQUIRED_RAW = ["open", "high", "low", "close", "volume"] REQUIRED_FEATURES = ["returns_vs", "momentum_z", "atr_pct", "volume_z", "rsi_14", "bb_pct_b", "vol_regime", "return_accel", "pv_divergence"]

df = fetch_data(symbol) validate_schema(df, REQUIRED_RAW, "raw_data")

df = compute_features(df) validate_schema(df, REQUIRED_FEATURES, "features")

Gradient Health Checks

Basic Gradient Check

def check_gradient_health(model: nn.Module, sample_input: torch.Tensor) -> dict: """Verify gradients flow correctly through model.""" model.train() out = model(sample_input) loss = out.sum() loss.backward()

stats = {"total_params": 0, "params_with_grad": 0, "grad_norms": []}

for name, param in model.named_parameters():
    stats["total_params"] += 1
    if param.grad is not None:
        stats["params_with_grad"] += 1
        norm = param.grad.norm().item()
        stats["grad_norms"].append(norm)

        # Check for issues
        if not np.isfinite(norm):
            raise ValueError(f"Non-finite gradient in {name}: {norm}")
        if norm > 100:
            print(f"  WARNING: Large gradient in {name}: {norm:.2f}")

stats["max_grad"] = max(stats["grad_norms"]) if stats["grad_norms"] else 0
stats["mean_grad"] = np.mean(stats["grad_norms"]) if stats["grad_norms"] else 0

return stats

Architecture-Specific Checks

def check_lstm_gradients(model: nn.Module) -> dict: """Check LSTM-specific gradient patterns.""" stats = {}

for name, param in model.named_parameters():
    if param.grad is None:
        continue

    # Check forget gate bias (should not be too negative)
    if "bias_hh" in name or "bias_ih" in name:
        # LSTM bias: [i, f, g, o] gates
        hidden_size = param.shape[0] // 4
        forget_bias = param.grad[hidden_size:2*hidden_size]
        stats["forget_bias_grad_mean"] = forget_bias.mean().item()

    # Check hidden-to-hidden weights
    if "weight_hh" in name:
        stats["hh_weight_grad_norm"] = param.grad.norm().item()

return stats

5. Prediction Sanity Checks

Collapse Detection

def check_prediction_sanity(preds: np.ndarray, targets: np.ndarray) -> dict: """Detect prediction collapse or explosion.""" stats = { "pred_mean": preds.mean(), "pred_std": preds.std(), "pred_min": preds.min(), "pred_max": preds.max(), "target_std": targets.std(), }

# Relative threshold (not absolute!)
stats["pred_std_ratio"] = stats["pred_std"] / stats["target_std"]

# Collapse detection
if stats["pred_std_ratio"] &#x3C; 0.005:  # &#x3C; 0.5% of target variance
    raise ValueError(
        f"Predictions collapsed!\n"
        f"  pred_std: {stats['pred_std']:.6f}\n"
        f"  target_std: {stats['target_std']:.6f}\n"
        f"  ratio: {stats['pred_std_ratio']:.4%}"
    )

# Explosion detection
if stats["pred_std_ratio"] > 100:  # > 100x target variance
    raise ValueError(
        f"Predictions exploded!\n"
        f"  pred_std: {stats['pred_std']:.2f}\n"
        f"  target_std: {stats['target_std']:.6f}\n"
        f"  ratio: {stats['pred_std_ratio']:.1f}x"
    )

# Unique value check
stats["unique_values"] = len(np.unique(np.round(preds, 6)))
if stats["unique_values"] &#x3C; 10:
    print(f"  WARNING: Only {stats['unique_values']} unique prediction values")

return stats

Correlation Check

def check_prediction_correlation(preds: np.ndarray, targets: np.ndarray) -> float: """Check if predictions have any correlation with targets.""" corr = np.corrcoef(preds.flatten(), targets.flatten())[0, 1]

if not np.isfinite(corr):
    print("  WARNING: Correlation is NaN (likely collapsed predictions)")
    return 0.0

# Note: negative correlation may still be useful (short signal)
print(f"  Prediction-target correlation: {corr:.4f}")
return corr

6. NDJSON Logging Validation

Required Event Types

REQUIRED_EVENTS = { "experiment_start": ["architecture", "features", "config"], "fold_start": ["fold_id", "train_size", "val_size", "test_size"], "epoch_complete": ["epoch", "train_loss", "val_loss"], "fold_complete": ["fold_id", "test_sharpe", "test_sortino"], "experiment_complete": ["total_folds", "mean_sharpe", "elapsed_seconds"], }

def validate_ndjson_schema(log_path: Path) -> None: """Validate NDJSON log has all required events and fields.""" events = {} with open(log_path) as f: for line in f: event = json.loads(line) phase = event.get("phase", "unknown") if phase not in events: events[phase] = [] events[phase].append(event)

for phase, required_fields in REQUIRED_EVENTS.items():
    if phase not in events:
        raise ValueError(f"Missing event type: {phase}")

    sample = events[phase][0]
    missing = [f for f in required_fields if f not in sample]
    if missing:
        raise ValueError(f"Event '{phase}' missing fields: {missing}")

print(f"  NDJSON schema valid: {len(events)} event types")

7. POC Timing Guide

Check Typical Time Max Time Action if Exceeded

Model instantiation < 1s 5s Check device, reduce model size

Gradient flow < 2s 10s Check batch size

Schema validation < 0.1s 1s Check data loading

Mini training (10 epochs) < 30s 2min Reduce batch, check data loader

Full POC (10 checks) < 2min 5min Something is wrong

Failure Response Guide

Failure Likely Cause Fix

Shape mismatch Wrong input_size or seq_len Check feature count

NaN gradients LR too high, bad init Reduce LR, check init

Zero gradients Dead layers, missing params Check model architecture

Predictions collapsed Normalizer issue, bad loss Check sLSTM normalizer

Predictions exploded Gradient explosion Add/tighten gradient clipping

Schema missing columns Wrong data source Check fetch function

Checkpoint load fails State dict key mismatch Check model architecture match

Integration Example

def main(): # Parse args, setup output dir...

# PHASE 1: Fail-fast POC
print("=" * 60)
print("FAIL-FAST POC VALIDATION")
print("=" * 60)

try:
    run_poc_validation()
except Exception as e:
    print(f"\n{'=' * 60}")
    print(f"POC FAILED: {type(e).__name__}")
    print(f"{'=' * 60}")
    print(f"Error: {e}")
    print("\nFix the issue before running full experiment.")
    sys.exit(1)

# PHASE 2: Full experiment (only if POC passes)
print("\n" + "=" * 60)
print("STARTING FULL EXPERIMENT")
print("=" * 60)

run_full_experiment()

10. Anti-Patterns to Avoid

DON'T: Skip validation to "save time"

BAD: "I'll just run it and see"

run_full_experiment() # 4 hours later: crash

DON'T: Use absolute thresholds for relative quantities

BAD: Absolute threshold

assert pred_std > 1e-4 # Meaningless for returns ~0.001

GOOD: Relative threshold

assert pred_std / target_std > 0.005 # 0.5% of target variance

DON'T: Catch all exceptions silently

BAD: Hides real issues

try: result = risky_operation() except Exception: result = default_value # What went wrong?

GOOD: Catch specific exceptions

try: result = risky_operation() except (ValueError, RuntimeError) as e: logger.error(f"Operation failed: {e}") raise

DON'T: Print without flush

BAD: Output buffered, can't see progress

print(f"Processing fold {i}...")

GOOD: See output immediately

print(f"Processing fold {i}...", flush=True)

References

Schema validation in data pipelines
PyTorch gradient debugging
NDJSON specification

Troubleshooting

Issue Cause Solution

NaN gradients in POC Learning rate too high Reduce LR by 10x, check weight initialization

Zero gradients Dead layers or missing params Check model architecture, verify requires_grad=True

Predictions collapsed Normalizer issue or bad loss Check target normalization, verify loss function

Predictions exploded Gradient explosion Add gradient clipping, reduce learning rate

Schema missing columns Wrong data source or transform Verify fetch function returns expected columns

Checkpoint load fails State dict key mismatch Ensure model architecture matches saved checkpoint

POC timeout (>5 min) Data loading or model too large Reduce batch size, check DataLoader num_workers

Mini training no progress Learning rate too low or frozen Increase LR, verify optimizer updates all parameters

NDJSON validation fails Missing required event types Check all phases emit expected fields

Shape mismatch error Wrong input_size or seq_len Verify feature count matches model input dimension

ml-failfast-validation

Safety Notice

Copy this and send it to your AI assistant to learn