evaluate-skill

Evaluate a skill's effectiveness by running behavioral test cases and grading the results against assertions.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "evaluate-skill" with this command: npx skills add laurigates/claude-plugins/laurigates-claude-plugins-evaluate-skill

/evaluate:skill

Evaluate a skill's effectiveness by running behavioral test cases and grading the results against assertions.

When to Use This Skill

Use this skill when... Use alternative when...

Want to test if a skill produces correct results Need structural validation -> scripts/plugin-compliance-check.sh

Validating skill improvements before merging Want to file feedback about a session -> /feedback:session

Benchmarking a skill against a baseline Need to check skill freshness -> /health:audit

Creating eval cases for a new skill Want to review code quality -> /code-review

Context

  • Skill file: !find $1/skills -name "SKILL.md" -maxdepth 3

  • Eval cases: !find $1/skills -name "evals.json" -maxdepth 3

Parameters

Parse these from $ARGUMENTS :

Parameter Default Description

<plugin/skill-name>

required Path as plugin-name/skill-name

--create-evals

false Generate eval cases if none exist

--runs N

1 Number of runs per eval case

--baseline

false Also run without skill for comparison

Execution

Step 1: Resolve skill path

Parse $ARGUMENTS to extract <plugin-name> and <skill-name> . The skill file lives at:

<plugin-name>/skills/<skill-name>/SKILL.md

Read the SKILL.md to confirm it exists and understand what the skill does.

Step 2: Run structural pre-check

Run the compliance check to confirm the skill passes basic structural validation:

bash scripts/plugin-compliance-check.sh <plugin-name>

If structural issues are found, report them and stop. Behavioral evaluation on a structurally broken skill is wasted effort.

Step 3: Load or create eval cases

Look for <plugin-name>/skills/<skill-name>/evals.json .

If the file exists: read and validate it against the evals.json schema (see evaluate-plugin/references/schemas.md ).

If the file does not exist AND --create-evals is set: Analyze the SKILL.md and generate eval cases:

  • Read the skill thoroughly — understand its purpose, parameters, execution steps, and expected behaviors.

  • Generate 3-5 eval cases covering:

  • Happy path: Standard usage that should work correctly

  • Edge case: Unusual but valid inputs

  • Boundary: Inputs that test the limits of the skill's scope

  • For each eval case, write:

  • id : Unique identifier (e.g., eval-001 )

  • description : What this test validates

  • prompt : The user prompt to simulate

  • expectations : List of assertion strings the output should satisfy

  • tags : Categorization tags

  • Write the generated cases to <plugin-name>/skills/<skill-name>/evals.json .

If the file does not exist AND --create-evals is NOT set: Report that no eval cases exist and suggest running with --create-evals .

Step 4: Run evaluations

For each eval case, for each run (up to --runs N ):

  • Create a results directory: <plugin-name>/skills/<skill-name>/eval-results/runs/<eval-id>-run-<N>/

  • Record the start time.

  • Spawn a Task subagent (subagent_type: general-purpose ) that:

  • Receives the skill content as context

  • Executes the eval prompt

  • Works in the repository as if it were a real user request

  • Capture the subagent output.

  • Record timing data (duration) and write to timing.json .

  • Write the transcript to transcript.md .

Step 5: Run baseline (if --baseline)

If --baseline is set, repeat Step 4 but without loading the skill content. This creates a comparison point to measure skill effectiveness.

Use the same eval prompts and record results in a parallel baseline/ subdirectory.

Step 6: Grade results

For each run, delegate grading to the eval-grader agent via Task:

Task subagent_type: eval-grader Prompt: Grade this eval run against the assertions. Eval case: <eval case from evals.json> Transcript: <path to transcript.md> Output artifacts: <list of created/modified files>

The grader produces grading.json for each run.

Step 7: Aggregate and report

Compute aggregate statistics across all runs:

  • Mean pass rate (assertions passed / total assertions)

  • Standard deviation of pass rate

  • Mean duration

If --baseline was used, also compute:

  • Baseline mean pass rate

  • Delta (improvement from skill)

Write aggregated results to <plugin-name>/skills/<skill-name>/eval-results/benchmark.json .

Print a summary table:

Evaluation Results: <plugin/skill-name>

MetricWith SkillBaselineDelta
Pass Rate85%42%+43%
Duration14s12s+2s
Runs33

Per-Eval Breakdown

EvalDescriptionPass RateStatus
eval-001Basic usage100%PASS
eval-002Edge case67%PARTIAL
eval-003Boundary100%PASS

Agentic Optimizations

Context Command

Check skill exists ls <plugin>/skills/<skill>/SKILL.md

Check evals exist ls <plugin>/skills/<skill>/evals.json

Read evals cat <plugin>/skills/<skill>/evals.json | jq .

Create results dir mkdir -p <plugin>/skills/<skill>/eval-results/runs

Write JSON jq -n '<expression>' > file.json

Aggregate results bash evaluate-plugin/scripts/aggregate_benchmark.sh <plugin>

Quick Reference

Flag Description

--create-evals

Generate eval cases from SKILL.md analysis

--runs N

Number of runs per eval case (default: 1)

--baseline

Run without skill for comparison

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ruff linting

No summary provided by upstream source.

Repository SourceNeeds Review
General

imagemagick-conversion

No summary provided by upstream source.

Repository SourceNeeds Review
General

jq json processing

No summary provided by upstream source.

Repository SourceNeeds Review
General

api-testing

No summary provided by upstream source.

Repository SourceNeeds Review