evaluate-plugin-batch

/evaluate:plugin

Batch evaluate all skills in a plugin. Runs /evaluate:skill for each skill, then produces a plugin-level quality report.

When to Use This Skill

Use this skill when... Use alternative when...

Auditing all skills in a plugin before release Evaluating a single skill -> /evaluate:skill

Establishing quality baselines across a plugin Viewing past results -> /evaluate:report

Checking overall plugin quality after refactoring Need structural compliance -> plugin-compliance-check.sh

Context

Plugin skills: !find $1/skills -name "SKILL.md" -maxdepth 3
Existing evals: !find $1/skills -name "evals.json" -maxdepth 3

Parameters

Parse these from $ARGUMENTS :

Parameter Default Description

<plugin-name>

required Name of the plugin to evaluate

--create-missing-evals

false Generate evals for skills that lack them

--parallel N

1 Max concurrent skill evaluations

Execution

Step 1: Discover skills

Find all skills in the plugin:

<plugin-name>/skills/*/SKILL.md

List them and count the total.

Step 2: Filter and prepare

For each skill, check if evals.json exists:

Has evals: include in evaluation
No evals + --create-missing-evals : include, will create evals during evaluation
No evals, no flag: skip with a note

Report the breakdown:

Found N skills in <plugin-name>:

M with eval cases
K without eval cases (skipped | will create)

Step 3: Run evaluations

For each included skill, invoke /evaluate:skill via the SlashCommand tool:

SlashCommand: /evaluate:skill <plugin-name>/<skill-name> [--create-evals]

If --parallel N is set and N > 1, batch evaluations into groups of N. Otherwise, run sequentially.

Track progress with TodoWrite — mark each skill as it completes.

Step 4: Aggregate plugin report

After all skill evaluations complete, read each skill's benchmark.json and aggregate:

bash evaluate-plugin/scripts/aggregate_benchmark.sh <plugin-name>

Write aggregated results to <plugin-name>/eval-results/plugin-benchmark.json .

Step 5: Report

Print a plugin-level summary table:

Plugin Evaluation: <plugin-name>

Skill	Evals	Pass Rate	Status
skill-a	4	100%	PASS
skill-b	3	67%	PARTIAL
skill-c	5	80%	PASS

Overall: 82% pass rate across N eval cases

Rank skills by pass rate. Flag any below 50% as needing attention.

Agentic Optimizations

Context Command

List plugin skills ls -d <plugin>/skills/*/SKILL.md

Check for evals find <plugin>/skills -name evals.json

Count skills ls -d <plugin>/skills/*/SKILL.md | wc -l

Aggregate results bash evaluate-plugin/scripts/aggregate_benchmark.sh <plugin>

Quick Reference

Flag Description

--create-missing-evals

Generate eval cases for skills without them

--parallel N

Max concurrent evaluations (default: 1)

evaluate-plugin-batch

Safety Notice

Copy this and send it to your AI assistant to learn

Plugin Evaluation: <plugin-name>

Source Transparency

Related Skills

ruff-linting

imagemagick-conversion

jq json processing

api-testing