data-science-model-evaluation

Use this skill for rigorously assessing model performance, comparing alternatives, and diagnosing issues.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "data-science-model-evaluation" with this command: npx skills add legout/data-platform-agent-skills/legout-data-platform-agent-skills-data-science-model-evaluation

Model Evaluation

Use this skill for rigorously assessing model performance, comparing alternatives, and diagnosing issues.

When to use this skill

  • Model training complete — need performance assessment

  • Comparing multiple models/algorithms

  • Diagnosing overfitting/underfitting

  • Hyperparameter tuning

  • Production readiness check

Evaluation workflow

Cross-validation strategy

  • K-fold (default for most cases)

  • Stratified K-fold (classification with imbalance)

  • TimeSeriesSplit (temporal data)

  • GroupKFold (grouped/clustered data)

Choose appropriate metrics

  • Classification: accuracy, precision, recall, F1, ROC-AUC, PR-AUC

  • Regression: MAE, RMSE, R², MAPE

  • Ranking: NDCG, MAP

  • Business: custom metrics tied to outcomes

Analyze performance

  • Cross-validation mean ± std

  • Validation curve (bias-variance tradeoff)

  • Learning curves (data sufficiency)

  • Error analysis by segment

Model comparison

  • Statistical significance (paired t-test, McNemar)

  • Calibration (for probability outputs)

  • Speed vs accuracy tradeoffs

Quick tool selection

Task Default choice Notes

Cross-validation sklearn.model_selection Standard CV, stratified, time series

Metrics sklearn.metrics Comprehensive metric suite

Hyperparameter tuning Optuna or Ray Tune Efficient search algorithms

Model comparison scikit-learn + statistical tests Paired comparisons

Experiment tracking MLflow or Weights & Biases Track runs, metrics, artifacts

Core implementation rules

  1. Always use proper validation

from sklearn.model_selection import cross_val_score, StratifiedKFold

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42) scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc') print(f"CV AUC: {scores.mean():.3f} (+/- {scores.std() * 2:.3f})")

  1. Match metrics to problem

Classification with imbalance

from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(y_true, y_pred))

Focus on F1, precision/recall for minority class

Regression

from sklearn.metrics import mean_absolute_error, root_mean_squared_error

print(f"MAE: {mean_absolute_error(y_true, y_pred):.3f}") print(f"RMSE: {root_mean_squared_error(y_true, y_pred):.3f}")

  1. Analyze errors systematically

Error by segment

errors = y_pred != y_true error_df = X_test[errors] error_df['true'] = y_true[errors] error_df['pred'] = y_pred[errors]

Analyze patterns in errors

print(error_df.groupby('category').size())

  1. Track experiments

import mlflow

with mlflow.start_run(): mlflow.log_params(params) mlflow.log_metrics({'auc': auc, 'f1': f1}) mlflow.sklearn.log_model(model, 'model')

Common anti-patterns

  • ❌ Single train/test split without CV

  • ❌ Optimizing wrong metric (accuracy on imbalanced data)

  • ❌ Data leakage in preprocessing

  • ❌ Not checking calibration for probability outputs

  • ❌ Ignoring inference speed/memory constraints

  • ❌ No error analysis or debugging bad predictions

Progressive disclosure

  • ../references/cross-validation.md — CV strategies for different data types

  • ../references/metrics-guide.md — Choosing and interpreting metrics

  • ../references/hyperparameter-tuning.md — Optuna, Ray Tune patterns

  • ../references/experiment-tracking.md — MLflow, W&B setup

Related skills

  • @data-science-feature-engineering — Features to evaluate

  • @data-engineering-orchestration — Production model deployment

  • @data-engineering-observability — Model monitoring in production

References

  • sklearn Model Selection

  • sklearn Metrics

  • Optuna Documentation

  • MLflow Tracking

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

data-science-eda

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-visualization

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

data-science-feature-engineering

No summary provided by upstream source.

Repository SourceNeeds Review