Skill Eval
Use this skill as an authoring-side preflight for OpenClaw skills.
It is not a full runtime evaluator. It helps a skill author move from "this skill exists" to "this skill is structured well enough for first-pass evaluation and later regression work."
Good Requests
This skill is a good fit for requests like:
- "Set up eval files for this skill before I publish it."
- "Check whether this skill is ready for a first eval pass."
- "Review the description and tell me whether trigger coverage is clear enough."
- "Generate with-skill and without-skill static comparison artifacts for this skill."
Not A Good Fit
Do not rely on this skill alone for requests like:
- large-scale live runtime benchmarking
- scoring response quality across many real conversations
- tool-call correctness or factuality audits
- end-to-end production regression testing
Use a deeper evaluator after this step when you need those capabilities.
Best Fit
Use this skill when you need to:
- initialize
evals/files for a new or existing skill - confirm a skill is ready for a first eval pass
- make positive and negative trigger coverage explicit
- catch placeholder content before sharing a skill
- write static run summaries and with-skill/without-skill comparison artifacts
Use a deeper evaluator after this step when you need live runtime experiments, tool-call quality checks, or richer output scoring.
Position In The Flow
Recommended sequence:
skill-vetter -> install/review -> skill-eval -> deeper runtime eval
skill-vetteranswers: "Is this skill safe enough to inspect or install?"skill-evalanswers: "Is this skill structured well enough to evaluate seriously?"- a deeper evaluator answers: "How well does the skill perform in practice?"
Workflow
- Confirm the target folder is a skill directory with
SKILL.md. - If the skill came from another repo or another person, do a safety review first.
- If
evals/does not exist, initialize it with:evals/evals.jsonevals/triggers.jsonevals/README.md
- Replace placeholder prompts with realistic authoring examples.
- Run the readiness check before any deeper benchmarking.
- If readiness fails, fix the missing pieces first instead of forcing a run.
- Generate static run artifacts only after the inputs are usable.
Scripts
Initialize eval files:
python3 scripts/init_eval.py /path/to/skill
Check readiness:
python3 scripts/check_eval_readiness.py /path/to/skill
Run static eval checks:
python3 scripts/run_eval.py /path/to/skill
python3 scripts/run_eval.py /path/to/skill --check readiness
python3 scripts/run_eval.py /path/to/skill --check triggers
python3 scripts/run_eval.py /path/to/skill --check artifacts
python3 scripts/run_eval.py /path/to/skill --check files
python3 scripts/run_eval.py /path/to/skill --mode with-skill
python3 scripts/run_eval.py /path/to/skill --mode without-skill --run-group demo-baseline
python3 scripts/compare_runs.py /path/to/skill --run-group demo-baseline
Readiness Rules
A skill is ready for first-pass evaluation only when:
SKILL.mdexists- the frontmatter
descriptionis real and not a placeholder evals/evals.jsonhas at least one non-placeholder eval caseevals/triggers.jsonhas at least one positive and one negative non-placeholder trigger case
What This Skill Checks Well
- missing or empty eval scaffolding
- placeholder prompts that would make an eval meaningless
- missing positive/negative trigger coverage
- empty or malformed
expected_artifacts - malformed optional
filesdeclarations - static with-skill/without-skill run artifact organization
Current Limits
run_eval.py does not perform live trigger experiments against the OpenClaw runtime.
It does not score real outputs for quality, factuality, or tool correctness.
Today it performs static validation passes that:
- verify trigger files exist
- verify cases are non-placeholder
- verify positive and negative sets are both populated
- verify eval cases have usable
expected_artifacts - verify declared
filesentries are well-formed - write mode-specific run summaries for later comparison
Why Publish This Skill
This skill is for authors who do not yet need a full eval lab, but do need a clean starting point. It is most useful as a lightweight preflight and scaffolding step before deeper evaluation.
Release Readiness Checklist
Before calling a skill "ready for release," aim for all of the following:
- the description names concrete trigger scenarios
- positive and negative trigger cases both exist
- placeholder content is gone
- each eval case describes observable expected artifacts
- static run summaries can be generated without errors
Compare Runs
Use compare_runs.py after both modes exist in the same run-group.
It compares:
- overall pass/fail
- per-check pass/fail
- mode-specific errors
- mode-specific notes
It writes comparison artifacts under the run-group root.
References
Read references/eval_format.md when you need the expected file formats and field meanings.