Model Evaluator
Overview
Provides comprehensive, unbiased model evaluation following ML best practices. Goes beyond simple accuracy to evaluate models across multiple dimensions, ensuring confident deployment decisions.
Core Evaluation Framework
- Classification Metrics
-
Accuracy, Precision, Recall, F1-score
-
ROC AUC, PR AUC
-
Confusion matrix
-
Per-class metrics (for multi-class)
-
Class imbalance handling
- Regression Metrics
-
RMSE, MAE, MAPE
-
R² score, Adjusted R²
-
Residual analysis
-
Prediction interval coverage
- Ranking Metrics (Recommendations)
-
Precision@K, Recall@K
-
NDCG@K, MAP@K
-
MRR (Mean Reciprocal Rank)
-
Coverage, Diversity
- Statistical Validation
-
Cross-validation (K-fold, stratified, time-series)
-
Confidence intervals
-
Statistical significance testing
-
Calibration curves
Usage
from specweave import ModelEvaluator
evaluator = ModelEvaluator( model=trained_model, X_test=X_test, y_test=y_test, increment="0042" )
Comprehensive evaluation
report = evaluator.evaluate_all()
Generates:
- .specweave/increments/0042.../evaluation-report.md
- Visualizations (confusion matrix, ROC curves, etc.)
- Statistical tests
Evaluation Report Structure
Model Evaluation Report: XGBoost Classifier
Overall Performance
- Accuracy: 0.87 ± 0.02 (95% CI: [0.85, 0.89])
- ROC AUC: 0.92 ± 0.01
- F1 Score: 0.85 ± 0.02
Per-Class Performance
| Class | Precision | Recall | F1 | Support |
|---|---|---|---|---|
| Class 0 | 0.88 | 0.85 | 0.86 | 1000 |
| Class 1 | 0.84 | 0.87 | 0.86 | 800 |
Confusion Matrix
[Visualization embedded]
Cross-Validation Results
- 5-fold CV accuracy: 0.86 ± 0.03
- Fold scores: [0.85, 0.88, 0.84, 0.87, 0.86]
- No overfitting detected (train=0.89, val=0.86, gap=0.03)
Statistical Tests
- Comparison vs baseline: p=0.001 (highly significant)
- Comparison vs previous model: p=0.042 (significant)
Recommendations
✅ Deploy: Model meets accuracy threshold (>0.85) ✅ Stable: Low variance across folds ⚠️ Monitor: Class 1 recall slightly lower (0.84)
Model Comparison
from specweave import compare_models
models = { "baseline": baseline_model, "xgboost": xgb_model, "lightgbm": lgbm_model, "neural-net": nn_model }
comparison = compare_models( models, X_test, y_test, metrics=["accuracy", "auc", "f1"], increment="0042" )
Output:
Model Comparison Report
| Model | Accuracy | ROC AUC | F1 | Inference Time | Model Size |
|---|---|---|---|---|---|
| baseline | 0.65 | 0.70 | 0.62 | 1ms | 10KB |
| xgboost | 0.87 | 0.92 | 0.85 | 35ms | 12MB |
| lightgbm | 0.86 | 0.91 | 0.84 | 28ms | 8MB |
| neural-net | 0.85 | 0.90 | 0.83 | 120ms | 45MB |
Recommendation: XGBoost
- Best accuracy and AUC
- Acceptable inference time (<50ms requirement)
- Good size/performance tradeoff
Best Practices
-
Always compare to baseline - Random, majority, rule-based
-
Use cross-validation - Never trust single split
-
Check calibration - Are probabilities meaningful?
-
Analyze errors - What types of mistakes?
-
Test statistical significance - Is improvement real?
Integration with SpecWeave
Evaluate model in increment
/ml:evaluate-model 0042
Compare all models in increment
/ml:compare-models 0042
Generate full evaluation report
/ml:evaluation-report 0042
Evaluation results automatically included in increment COMPLETION-SUMMARY.md.