model-evaluator

Provides comprehensive, unbiased model evaluation following ML best practices. Goes beyond simple accuracy to evaluate models across multiple dimensions, ensuring confident deployment decisions.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "model-evaluator" with this command: npx skills add anton-abyzov/specweave/anton-abyzov-specweave-model-evaluator

Model Evaluator

Overview

Provides comprehensive, unbiased model evaluation following ML best practices. Goes beyond simple accuracy to evaluate models across multiple dimensions, ensuring confident deployment decisions.

Core Evaluation Framework

  1. Classification Metrics
  • Accuracy, Precision, Recall, F1-score

  • ROC AUC, PR AUC

  • Confusion matrix

  • Per-class metrics (for multi-class)

  • Class imbalance handling

  1. Regression Metrics
  • RMSE, MAE, MAPE

  • R² score, Adjusted R²

  • Residual analysis

  • Prediction interval coverage

  1. Ranking Metrics (Recommendations)
  • Precision@K, Recall@K

  • NDCG@K, MAP@K

  • MRR (Mean Reciprocal Rank)

  • Coverage, Diversity

  1. Statistical Validation
  • Cross-validation (K-fold, stratified, time-series)

  • Confidence intervals

  • Statistical significance testing

  • Calibration curves

Usage

from specweave import ModelEvaluator

evaluator = ModelEvaluator( model=trained_model, X_test=X_test, y_test=y_test, increment="0042" )

Comprehensive evaluation

report = evaluator.evaluate_all()

Generates:

- .specweave/increments/0042.../evaluation-report.md

- Visualizations (confusion matrix, ROC curves, etc.)

- Statistical tests

Evaluation Report Structure

Model Evaluation Report: XGBoost Classifier

Overall Performance

  • Accuracy: 0.87 ± 0.02 (95% CI: [0.85, 0.89])
  • ROC AUC: 0.92 ± 0.01
  • F1 Score: 0.85 ± 0.02

Per-Class Performance

ClassPrecisionRecallF1Support
Class 00.880.850.861000
Class 10.840.870.86800

Confusion Matrix

[Visualization embedded]

Cross-Validation Results

  • 5-fold CV accuracy: 0.86 ± 0.03
  • Fold scores: [0.85, 0.88, 0.84, 0.87, 0.86]
  • No overfitting detected (train=0.89, val=0.86, gap=0.03)

Statistical Tests

  • Comparison vs baseline: p=0.001 (highly significant)
  • Comparison vs previous model: p=0.042 (significant)

Recommendations

✅ Deploy: Model meets accuracy threshold (>0.85) ✅ Stable: Low variance across folds ⚠️ Monitor: Class 1 recall slightly lower (0.84)

Model Comparison

from specweave import compare_models

models = { "baseline": baseline_model, "xgboost": xgb_model, "lightgbm": lgbm_model, "neural-net": nn_model }

comparison = compare_models( models, X_test, y_test, metrics=["accuracy", "auc", "f1"], increment="0042" )

Output:

Model Comparison Report

ModelAccuracyROC AUCF1Inference TimeModel Size
baseline0.650.700.621ms10KB
xgboost0.870.920.8535ms12MB
lightgbm0.860.910.8428ms8MB
neural-net0.850.900.83120ms45MB

Recommendation: XGBoost

  • Best accuracy and AUC
  • Acceptable inference time (<50ms requirement)
  • Good size/performance tradeoff

Best Practices

  • Always compare to baseline - Random, majority, rule-based

  • Use cross-validation - Never trust single split

  • Check calibration - Are probabilities meaningful?

  • Analyze errors - What types of mistakes?

  • Test statistical significance - Is improvement real?

Integration with SpecWeave

Evaluate model in increment

/ml:evaluate-model 0042

Compare all models in increment

/ml:compare-models 0042

Generate full evaluation report

/ml:evaluation-report 0042

Evaluation results automatically included in increment COMPLETION-SUMMARY.md.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

technical-writing

No summary provided by upstream source.

Repository SourceNeeds Review
General

spec-driven-brainstorming

No summary provided by upstream source.

Repository SourceNeeds Review
General

kafka-architecture

No summary provided by upstream source.

Repository SourceNeeds Review