model-evaluation-benchmark

Model Evaluation Benchmark Skill

Purpose: Automated reproduction of comprehensive model evaluation benchmarks following the Benchmark Suite V3 reference implementation.

Auto-activates when: User requests model benchmarking, comparison evaluation, or performance testing between AI models in agentic workflows.

Skill Description

This skill orchestrates end-to-end model evaluation benchmarks that measure:

The skill automates the entire benchmark workflow from execution through cleanup, following the v3 reference implementation.

When to Use

✅ Use when:

❌ Don't use when:

Execution Instructions

When this skill is invoked, follow these steps:

Phase 1: Setup

Phase 2: Execute Benchmarks

For each task × model:

cd tests/benchmarks/benchmark_suite_v3 python run_benchmarks.py --model {opus|sonnet} --tasks 1,2,3,4

Phase 3: Analyze Results

Read all result files: ~/.amplihack/.claude/runtime/benchmarks/suite_v3/*/result.json
Launch parallel Task tool calls with subagent_type="reviewer" to:
Analyze trace logs for tool/agent/skill usage
Score code quality (1-5 scale)
Synthesize findings

Phase 4: Generate Report

Phase 5: Cleanup (MANDATORY)

See tests/benchmarks/benchmark_suite_v3/CLEANUP_PROCESS.md for detailed cleanup instructions.

Example Usage

User: "Run model evaluation benchmark"Assistant: I'll run the complete benchmark suite following the v3 reference implementation.

[Executes phases 1-5 above]

References

Last Updated: 2025-11-26 Reference Implementation: Benchmark Suite V3 GitHub Issue Example: #1698

Source Transparency