evaluate-improve

/evaluate:improve

Analyze evaluation results and suggest concrete improvements to a skill.

When to Use This Skill

Use this skill when... Use alternative when...

Have eval results and want to improve the skill Need to run evals first -> /evaluate:skill

Want to improve skill description for better triggering Want to view raw results -> /evaluate:report

Iterating on a skill to increase pass rate Want to file a bug -> /feedback:session

Optimizing skill instructions after benchmarking Need structural fixes -> plugin-compliance-check.sh

Parameters

Parse these from $ARGUMENTS :

Parameter Default Description

required Path as plugin-name/skill-name

--apply

false Apply approved changes to SKILL.md

--description-only

false Focus on description improvements only

Execution

Step 1: Load eval results

Read the most recent benchmark from:

<plugin-name>/skills/<skill-name>/eval-results/benchmark.json

If no results exist, suggest running /evaluate:skill first and stop.

Also read the current SKILL.md to understand the skill.

Step 2: Analyze results

Delegate analysis to the eval-analyzer agent via Task:

Task subagent_type: eval-analyzer Prompt: Analyze these evaluation results and identify improvement opportunities. Skill: <path to SKILL.md> Benchmark: <benchmark.json contents> Mode: comparison (if baseline data exists) or benchmark (otherwise)

The analyzer produces categorized suggestions:

instructions: Execution flow improvements
description: Better intent-matching text
examples: Missing or insufficient examples
error_handling: Missing edge cases
tools: Better tool configurations
structure: Organizational improvements

Step 3: Filter suggestions

If --description-only , filter to only description category suggestions.

Sort remaining suggestions by priority (high > medium > low).

Step 4: Present suggestions

Present the categorized suggestions to the user:

Improvement Suggestions: <plugin/skill-name>

Current pass rate: 72%

High Priority

[instructions] Add explicit error handling for missing git config Evidence: eval-003 fails because the skill doesn't check for git user.name
[description] Add "conventional commit" as trigger phrase Evidence: Skill not selected when user says "make a conventional commit"

Medium Priority

[examples] Add breaking change example to execution steps Evidence: eval-004 inconsistently handles breaking changes

Low Priority

[structure] Move flag reference to Quick Reference table Evidence: Flags scattered across multiple sections

If --apply is NOT set, stop here.

Step 5: Apply changes (if --apply)

Use AskUserQuestion to let the user select which suggestions to apply:

Which improvements should I apply? [x] Add error handling for missing git config [x] Add trigger phrases to description [ ] Add breaking change example [ ] Restructure flag reference

For each approved suggestion:

Read the current SKILL.md
Apply the change using Edit
Update the modified date in frontmatter

After applying changes, update (or create) the history file at:

<plugin-name>/skills/<skill-name>/eval-results/history.json

Add a new iteration entry recording:

Version number (increment from previous)
Timestamp
Pass rate from current benchmark
Summary of changes made

Step 6: Suggest re-evaluation

After applying changes, suggest:

Changes applied. Run /evaluate:skill <plugin/skill-name> to measure improvement.

Agentic Optimizations

Context Command

Read benchmark cat <plugin>/skills/<skill>/eval-results/benchmark.json | jq .summary

Read skill cat <plugin>/skills/<skill>/SKILL.md

Read history cat <plugin>/skills/<skill>/eval-results/history.json | jq '.iterations[-1]'

Check pass rate cat <plugin>/skills/<skill>/eval-results/benchmark.json | jq '.summary.with_skill.mean_pass_rate'

Quick Reference

Flag Description

--apply

Apply approved changes to SKILL.md

--description-only

Focus on description improvements only