Skill-Eval v0.4.0: Self-Evolving Skill Evaluation Engine
An autonomous evaluation system for agent skills from ClawHub and other registries. Produces HuggingFace model-card style reports and a ranked leaderboard.
Informed by:
- Anthropic's skill-creator eval framework (multi-agent, blind A/B, trigger optimization)
- OpenAI's systematic skill testing (deterministic + rubric grading, trace analysis)
- Hamel Husain's evals-skills (error taxonomy, eval-audit pattern, judge calibration)
- OpenHands' monitoring loop (log -> evaluate -> dashboard -> aggregate feedback -> improve)
Architecture
skill_eval/
VERSION -- engine semver
SKILL-EVAL.md -- this file (the brain)
knowledge/
lessons.md -- accumulated eval wisdom
eval-patterns.md -- reusable test/assertion templates
failures.md -- failure mode catalog
skill-profiles/<slug>.md -- per-skill learned context
references/ -- source articles and frameworks
improve/ -- skill-improvement engine knowledge (NEW v0.4.0)
lessons.md -- improvement-specific lessons learned
patterns.md -- proven improvement patterns by category
failures.md -- improvement failure modes
skill-cards/ -- output: one .md per evaluation
leaderboard/
index.html -- interactive HTML leaderboard
scripts/
generate_skill_card.py -- skill card generator
generate_leaderboard.py -- leaderboard builder
evals/
skill-registry.json -- skills to evaluate
<slug>.json -- per-skill eval config
workspaces/ -- per-skill eval workspaces
Evaluation Philosophy
What makes a skill valuable?
A skill is valuable if and only if it makes the agent produce measurably better results than the agent would produce without it. "Better" means:
- Higher quality output — more correct, more complete, better structured
- More reliable behavior — consistent results across varied prompts
- Appropriate trade-offs — any overhead in time/tokens is justified by quality gain
A skill that produces identical results to baseline but costs 3x more tokens is a net negative. A skill that improves quality dramatically but takes 2x longer is likely worth it.
Two types of skills (Anthropic framework)
- Capability uplift — teaches the agent something it can't do well on its own. These may become unnecessary as models improve. Evals detect when that happens.
- Encoded preference — sequences steps according to a specific process. More durable, but must be evaluated for fidelity to the intended workflow.
Understanding which type a skill is affects how we design assertions.
Multi-Model Support (v0.4.0)
Skills should work across models, not just the one used to test them. The engine supports configuring different models for different roles.
Model Roles
There are three distinct model roles in the evaluation pipeline:
- Execution Model — The model that runs the skill (with-skill) and the baseline (without-skill). This is the model the skill is being tested ON.
- Judge Model — The model that grades rubric-based assertions (Layer 2 quality assessment). Should ideally be different from the execution model to avoid self-grading bias.
- Improvement Model — The model that rewrites low-scoring skills in Phase 10. Can differ from execution model to bring diverse improvement strategies.
Configuration
Model configuration lives in evals/models.json:
{
"execution_models": [
"anthropic/claude-opus-4-6",
"openai/gpt-4.1",
"google/gemini-2.5-pro"
],
"judge_model": "anthropic/claude-opus-4-6",
"improvement_model": "anthropic/claude-opus-4-6",
"default_execution_model": "anthropic/claude-opus-4-6"
}
Execution Modes
- Single-model eval (default): Run with-skill and without-skill on one execution model. Fast, comparable to current behavior.
- Cross-model eval: Run the same skill across multiple execution models. Produces a per-model score matrix. Use this to answer: "Does this skill help GPT-4.1 as much as it helps Claude?"
- Cross-model judging: Use a different judge model than execution model. Reduces self-grading bias where the same model that produced the output also grades it.
Per-Skill Model Override
Individual eval configs (evals/<slug>.json) can override the global model config:
{
"skill_slug": "explain-code",
"models": {
"execution": ["anthropic/claude-opus-4-6", "openai/gpt-4.1"],
"judge": "google/gemini-2.5-pro"
},
"evals": [...]
}
If models is omitted, the global evals/models.json config is used.
Skill Card Multi-Model Output
When a skill is evaluated across multiple models, the skill card includes:
- Per-model score table: Score, pass rate, delta, and overhead for each execution model
- Cross-model consistency: Does the skill help all models equally? Large variance suggests model-specific value.
- Aggregate score: Weighted average across models (default: equal weight)
- Judge model attribution: Which model graded the results
Leaderboard Multi-Model Display
The leaderboard shows the aggregate score by default, with expandable per-model details. Skills that show consistent value across models rank higher than skills that only help one model.
Model Availability
Not all models may be available in every environment. The engine handles this gracefully:
- If a configured execution model is unavailable, skip it and note in the skill card
- If the judge model is unavailable, fall back to
default_execution_modelas judge - Always record which models were actually used vs configured
Evaluation Flow
Phase 1: Pre-flight Analysis
Before generating test cases, understand the skill:
- Read SKILL.md — understand claims, dependencies, target use cases
- Classify the skill — capability uplift or encoded preference? What category?
- Dependency check — required CLI tools, API keys, env vars. Log any that are missing.
- Dependency-gated skills: If a skill requires paid APIs or credentials that aren't available, mark it as
dependency-gatedinevals/<slug>.jsonand the benchmark. Do not run the eval -- it will produce environment failures, not skill-quality signals. Re-evaluate after credential provisioning. - Dependency matrix: For data-fetch or finance skills, document the full dependency matrix (API key, freshness source, fallback behavior) before proceeding.
- Phantom tooling check: If a skill references scripts, CLIs, or tools that don't actually exist in the skill package (e.g. documented
scrape_reviews.pywith no actual file), flag asphantom-toolingin the skill card. The skill's framework/template value can still be evaluated, but users should know the tooling is vaporware. (Learned from review-summarizer eval, Batch 3.) - Marketing claims check: If a skill claims specific metrics ("7.8x faster", "85% reduction") without evidence, note as
unsubstantiated-claimsin the skill card. Do not use the skill's self-reported numbers in scoring. (Learned from debug-checklist eval, Batch 3.) - Phantom tooling check: verify that scripts/binaries referenced by SKILL.md actually exist in the skill folder. If missing, mark
phantom-tooling: trueand split evaluation into (a) framework/template value and (b) operational tooling value.
- Dependency-gated skills: If a skill requires paid APIs or credentials that aren't available, mark it as
- Read knowledge base — check
knowledge/lessons.md,eval-patterns.md,failures.mdfor relevant patterns - Check for prior evaluations — has this skill been evaluated before? Load
knowledge/skill-profiles/<slug>.md
Phase 2: Test Case Design
Design 2-3 test prompts following OpenAI's four-category framework:
Success categories to check:
- Outcome — Did the task complete? Is the output correct?
- Process — Did the agent invoke the skill and follow its intended steps?
- Style — Does output follow skill-claimed conventions?
- Efficiency — Reasonable time/token usage? No thrashing?
Prompt design principles:
- Realistic — what a real user would actually type, with context and detail
- Discriminating — should reveal whether the skill adds value, not just whether Claude can do the task
- Diverse — cover different aspects of what the skill claims
- Include at least one prompt that tests implicit triggering (describes the need without naming the skill)
Assertion design (two layers):
Layer 1: Deterministic checks
- File existence, word counts, keyword presence
- Format compliance (valid JSON, valid SQL, valid markdown)
- Programmatic verification (run tests, check syntax)
- Fast, explainable, reproducible
Layer 2: Rubric-based quality assessment
- Does the output follow claimed conventions?
- Is the quality meaningfully different from baseline?
- Structured scoring (not just pass/fail) for subjective aspects
- Use LLM-as-judge with specific rubric, not vague "is this good?"
Assertion anti-patterns (from lessons learned):
- Assertions that pass in both with-skill and without-skill are non-discriminating. Always include at least one assertion targeting skill-specific behavior.
- Don't test things the base model always gets right anyway. Baseline models are already strong at generic writing -- career/profile/copy skills need sharper, skill-specific assertions (ATS constraints, section schema, audience tuning) to prove value.
- For technical correctness tasks (SQL optimization, debugging, code explanation), baseline models are often already strong. Prefer assertions on methodology adherence and output structure over raw correctness.
- Don't use vague assertions like "output is high quality" — be specific about what quality means.
- Subjective quality (writing style, design taste) can't be graded with binary assertions. Use rubric-based scoring or flag for human review.
- For multilingual skills, include bilingual keyword variants in deterministic assertions to avoid false negatives (e.g., 索引/index, 前导通配符/leading wildcard).
Output-floor assertions (from failure modes):
- Skills that define required output sections (source, freshness, disclaimer, etc.) must have deterministic assertions on those sections even in error/fallback paths. Template compliance drift under data outages is a known failure mode -- the error path bypasses required formatting.
- Structured workflow skills for business ops should assert on operating cadence (weekly actions, owners, next steps) and metric validity.
Category-specific assertion patterns:
- Capability uplift skills (e.g. explain-code): target structural elements the model CAN produce but doesn't by default (analogies, diagrams, gotchas). These are excellent discriminators -- a 50%+ pass rate delta with moderate overhead is a strong value signal.
- Capability uplift for novel tools (e.g. secure-api-calls/keychains): when a skill teaches a tool the model has zero prior knowledge of, delta will be very high (87%+). This is expected -- the model literally can't produce correct output without the skill. Still needs dependency validation to prove operational value. (Learned from Batch 3.)
- Framework-heavy skills: can justify 50-90% time overhead IF they consistently improve actionability and formatting for stakeholder handoff. Assert on the added structure, not just correctness.
- CLI wrapper skills: assert tool invocation, meaningful output delta from baseline, and graceful dependency handling.
- Dependency-gated paid skills: assert graceful degradation with setup guidance when credentials are missing.
- Style/writing skills with banned-word lists (e.g. article-writer): banned-word assertions are perfect discriminators. The base model uses common filler words freely; a skill with a banned list eliminates them. Always add
keyword_absentassertions for each banned word. These are deterministic, easy to verify, and produce maximum delta. (Learned from Batch 3 -- article-writer scored 10/10 with 100% delta, the first perfect score.) - Technical analysis skills (SQL, debugging, etc.): the base model already excels at correctness for well-known domains. Focus assertions on methodology, output format, and systematic structure -- NOT on whether the answer is correct. The base model will get content right; the skill's value is in process consistency. (Learned from sql-query-optimizer and debug-checklist evals, Batch 3.)
- Chinese-language skills: assertions must include Chinese keyword variants alongside English ones (e.g. 索引/index, 前导通配符/leading wildcard) to avoid false negatives when the model responds in Chinese. (Learned from Batch 3.)
- Style-constrained writing skills: always add deterministic banned-word assertions when the skill defines forbidden vocabulary; these are highly discriminating and low-cost to grade.
- Technical checklist skills: assert category coverage and checklist completeness (including explicit N/A categories), not just bug/query correctness.
- Phantom-tooling framework skills: evaluate template/output structure separately from real data/tool execution; annotate benchmark with
phantom-toolingwhen scripts are missing.
Save test cases to evals/<slug>.json.
Phase 3: Execution
For each test case, determine the execution model(s) from the eval config or evals/models.json.
Single-model mode (default): Spawn two subagents simultaneously on the same execution model:
With-skill subagent:
[Model: <execution_model>]
Read the skill at <skill-path>/SKILL.md and follow its instructions.
Task: <prompt>
Save all outputs to: <workspace>/iteration-<N>/<test-name>/with_skill/outputs/
Without-skill (baseline) subagent:
[Model: <execution_model>]
Complete this task using only your built-in capabilities. Do NOT read any SKILL.md.
Task: <prompt>
Save all outputs to: <workspace>/iteration-<N>/<test-name>/without_skill/outputs/
Cross-model mode: When multiple execution models are configured, run the full with/without pair for EACH model. Organize outputs by model:
<workspace>/iteration-<N>/<test-name>/<model-slug>/with_skill/outputs/
<workspace>/iteration-<N>/<test-name>/<model-slug>/without_skill/outputs/
Capture timing data (tokens, duration, model used) from completion events into timing.json.
Phase 4: Grading
Grade each run against assertions. Two approaches:
Programmatic grading (preferred for deterministic checks):
- Write and run a script that checks file existence, word counts, keyword presence, etc.
- Faster, more reliable, reusable across iterations
LLM-based grading (for qualitative assessments):
- Use the configured judge model (from
evals/models.jsonor per-skill override), NOT the execution model - This prevents self-grading bias where the model that produced the output also judges it
- Read the output files
- Evaluate each assertion with evidence
- Use structured format:
{"text": "...", "passed": bool, "evidence": "..."} - Record
judge_modelin grading output for attribution
Save to grading.json:
{
"expectations": [
{"text": "assertion text", "passed": true, "evidence": "why this passed/failed"}
],
"summary": {"passed": N, "failed": N, "total": N, "pass_rate": 0.X}
}
Phase 5: Benchmark Aggregation
Create benchmark.json with:
- Per-eval results for both configurations
- Aggregate stats: mean, stddev, min, max for pass_rate, time, tokens
- Delta between with-skill and without-skill
- Analyst notes highlighting patterns
- Efficiency flags: Explicitly flag skills where quality delta is near zero but cost delta is >2x ("high-overhead framework inflation"). These should penalize the overall score, not just be noted.
- Dependency-gated annotations: If a skill was skipped or partially failed due to missing credentials, annotate it as
dependency-gatedso it doesn't pollute rankings with environment failures. - Phantom-tooling annotations: If referenced scripts/binaries are missing, annotate benchmark as
phantom-toolingand report separate judgments for framework quality vs operational readiness.
Phase 6: Skill Card Generation
python scripts/generate_skill_card.py \
--workspace workspaces/<slug>/iteration-<N> \
--skill-name "<Name>" \
--skill-slug "<slug>" \
--eval-model "claude-opus-4-6" \
--output skill-cards/<slug>-v<VERSION>.md
Each card includes:
- Metadata: skill name, source, eval date, model, engine version
- Overall score: 0-10 composite (quality 0-5 + delta 0-3 + efficiency 0-2)
- Comparison table: with-skill vs without-skill
- Per-test-case breakdown: assertions, timing, grading details
- Strengths / Weaknesses: auto-derived + analyst observations
- Recommendation: Recommended / Conditional / Marginal / Not Recommended
- Eval metadata JSON block: machine-readable for leaderboard
Phase 7: Leaderboard Update
python scripts/generate_leaderboard.py \
--cards-dir skill-cards \
--output leaderboard/index.html
Phase 8: Learning (Self-Evolution)
After each evaluation batch, update the knowledge base:
- lessons.md — What worked? What didn't? New patterns discovered?
- eval-patterns.md — New assertion templates for this skill category?
- failures.md — New failure modes found?
- skill-profiles/<slug>.md — Skill-specific notes for future re-evaluation
Key questions for the learning step:
- Which assertions discriminated well? (different pass rates with/without skill)
- Which assertions were useless? (always pass or always fail regardless)
- Were test prompts realistic enough?
- Did the grading catch the right things?
- What would we do differently next time?
Phase 9: Absorb Knowledge into SKILL-EVAL.md
This is the critical closing step. Without it, the engine documents lessons but doesn't actually evolve.
After updating the knowledge files, review them and fold actionable improvements back into this document:
- failures.md -> Phases 1-5 — Each new failure mode should produce a concrete change to the relevant phase (pre-flight gates, assertion templates, scoring adjustments, benchmark annotations).
- lessons.md -> Phase 2 — Proven discriminating assertion patterns should be added to the assertion design guidance, not just noted in lessons.
- eval-patterns.md -> Phase 2 — New category-specific patterns should be reflected in the assertion guidance for that category.
- Verify absorption — After updating, re-read the knowledge files and confirm every actionable item has a corresponding change in SKILL-EVAL.md. If something was noted but not absorbed, either absorb it or document why it was deferred.
When enough knowledge accumulates, bump VERSION. The version bump signals that the methodology itself has changed, not just the knowledge base.
The loop: eval -> knowledge -> SKILL-EVAL.md -> better evals. If knowledge doesn't flow back up, the engine isn't self-evolving.
Phase 10: Skill Improve (Self-Evolving Improvement Engine)
Trigger: Score < 7 (verdict = "Conditional", "Marginal", or "Not Recommended"), AND the skill is not dependency-gated.
The Skill Improvement Engine is itself a self-evolving system with its own knowledge base, learned patterns, and failure catalog. It gets better at improving skills over time.
Improvement Engine Knowledge Base
Located at knowledge/improve/:
lessons.md— What improvement strategies worked? What didn't? Which root causes are hardest to fix?patterns.md— Proven improvement patterns by skill category (e.g., "for reference-manual skills, delete 70%+ content and add MUST/ALWAYS/NEVER mandates")failures.md— Improvement failure modes: cases where improvement was attempted but didn't produce meaningful score gains, with root cause analysis
Before improving any skill, read all three files. The improvement engine should never repeat a failed strategy or miss a proven pattern.
Improvement Process
-
Read the improvement knowledge base:
knowledge/improve/lessons.md— proven strategies, anti-patternsknowledge/improve/patterns.md— category-specific improvement playbooksknowledge/improve/failures.md— what NOT to try, and why- Also read eval knowledge:
knowledge/lessons.md,eval-patterns.md,failures.md
-
Read the eval data:
- Failed assertions from
benchmark.json(what the skill got wrong) - Baseline output comparison (what the model does well without the skill)
- Skill profile from
knowledge/skill-profiles/<slug>.md - Category patterns from
knowledge/eval-patterns.md
- Failed assertions from
-
Diagnose root causes (check against known patterns):
- Is the skill too vague? (Doesn't specify enough to change model behavior)
- Is the skill redundant? (Teaches things the model already knows)
- Is the skill too heavy? (Adds overhead without proportional quality gain)
- Is the skill missing structure? (No clear output format, no enforceable conventions)
- Is there phantom tooling? (References tools that don't exist)
- Is it a reference manual? (200+ lines of educational content)
- Is it a library-as-skill? (Contains code instead of instructions)
- Cross-reference diagnosis against
knowledge/improve/patterns.mdfor category-matched strategies
-
Select improvement strategy from knowledge base:
- Match the diagnosed root cause to a proven pattern in
knowledge/improve/patterns.md - If no matching pattern exists, design a new strategy and document rationale
- If a similar improvement previously failed (per
knowledge/improve/failures.md), try a different approach or document why this case is different
- Match the diagnosed root cause to a proven pattern in
-
Rewrite SKILL.md:
- Apply the selected strategy
- Default formula: Remove > Add (delete 60-80% first, then add behavioral mandates)
- Add specific, enforceable conventions (banned words, required sections, output schemas)
- Remove redundant content the model already handles
- Add "quick mode" vs "full framework" routing if overhead is the issue
- Replace phantom tooling references with actual inline instructions
- Keep what works, fix what doesn't
- Save as
skills-under-test/<slug>/SKILL-improved.md
-
Update assertions to match improved skill:
- Add new assertions that test the behavioral mandates added in the rewrite
- Keep existing assertions that test baseline capabilities
- Save updated assertions alongside original for comparison
- This prevents the assertion-skill mismatch failure mode
-
Document changes:
- Write a changelog in
skills-under-test/<slug>/IMPROVEMENT-LOG.md - List what was changed and why, tied back to specific failed assertions
- Record which improvement pattern/strategy was applied
- Write a changelog in
What NOT to improve:
dependency-gatedskills (problem is environment, not skill quality)- Skills scoring >= 7 (already working well enough)
- Skills where the base model is strictly better (some skills are just bad ideas -- document why and move on)
Model selection for improvement: Use the configured improvement_model from evals/models.json. Different models may bring different improvement perspectives -- a model that didn't write the original skill may see blind spots the original author (or model) missed.
Phase 11: Re-Eval Improved Skills
Run the exact same eval config (evals/<slug>.json) against the improved SKILL.md, with updated assertions where applicable.
- Execute with
SKILL-improved.mdinstead of originalSKILL.md - Save outputs to
workspaces/<slug>/iteration-<N+1>/ - Grade with the same assertions (plus any new assertions added in Phase 10 step 6)
- Generate a comparison card in
skill-cards/<slug>-v<VERSION>-improved.md:- Original score vs improved score
- Per-assertion delta (which failures were fixed?)
- What changed in the SKILL.md and what effect it had
- Which improvement strategy was used (from
knowledge/improve/patterns.md)
- Update leaderboard with improved scores (mark as "improved" variant)
Success criteria:
- Score improved by >= 1.5 points
- At least 50% of previously-failed assertions now pass
- No regression on previously-passing assertions
If improvement fails (score doesn't meaningfully improve):
- Document in skill profile why the skill is fundamentally limited
- Mark as
improvement-attemptedin registry - Move on -- not every skill can be saved
Phase 12: Improvement Engine Learning (Self-Evolution)
This is the critical step that makes the improvement engine self-evolving.
After each improvement batch (Phase 10-11), update the improvement knowledge base:
-
Update
knowledge/improve/lessons.md:- What improvement strategies worked? By how much did scores increase?
- What strategies failed? Why?
- Any new root cause patterns discovered?
- Any model-specific insights? (Does GPT-4.1 improve skills differently than Claude?)
-
Update
knowledge/improve/patterns.md:- For each successful improvement, extract the reusable pattern
- Structure:
Category -> Root Cause -> Strategy -> Expected Gain - Example:
Reference Manual -> Redundant content -> Delete 70%, add MUST/ALWAYS/NEVER -> +1.5 to +2.0 points - Track success rate per pattern (how often does this strategy work?)
-
Update
knowledge/improve/failures.md:- Document each failed improvement attempt
- Root cause analysis: why didn't the strategy work?
- Was the skill fundamentally limited, or was the strategy wrong?
- Add "do not attempt" markers for known dead ends
-
Absorb into Phase 10:
- Review the improvement knowledge files
- Fold proven patterns back into the Phase 10 process guidance
- Update the diagnosis checklist with new root causes
- Update strategy selection with new proven patterns
- This is the improvement engine's equivalent of Phase 9 (absorb into SKILL-EVAL.md)
The improvement loop: improve -> re-eval -> learn -> better improvements. If improvement lessons don't flow back, the improvement engine is static.
Latest absorbed changes (Batch 3, 2026-03-09)
- Added phantom tooling check in pre-flight to catch skills that reference missing scripts/binaries.
- Added guidance to prefer methodology/structure assertions over correctness assertions for technical domains where baseline is already strong.
- Added multilingual assertion guidance (Chinese/English keyword variants) to reduce false negatives.
- Added category pattern for style-constrained writing skills with banned-word deterministic checks.
- Added benchmark annotation for phantom-tooling to separate framework value from operational readiness.
Absorbed changes (Full Re-Eval + Skill Improve, v0.3.0, 2026-03-09)
- Added reference manual anti-pattern detection to Phase 1 pre-flight: if SKILL.md >200 lines of educational content (code templates, API references, framework guides), flag as overhead risk. Skills should be behavioral contracts, not textbooks.
- Added library-as-skill anti-pattern detection: SKILL.md containing Python/JS class definitions or library code should be rewritten as behavioral instructions.
- Updated Phase 10 (Skill Improve) with proven improvement formula: Remove > Add. Start by deleting reference content the model already knows, then add behavioral mandates (MUST/ALWAYS/NEVER).
- Added guidance to Phase 11: when improving a skill, also update assertions to test new behavioral mandates. Otherwise improvement may not register in scores.
- Updated scoring system: zero delta + high overhead now penalizes efficiency score more heavily.
- Added overhead-sensitive skills category pattern for skills that are mostly reference material.
Absorbed changes (v0.4.0, 2026-03-10)
- Multi-model support: Added
evals/models.jsonconfig with three model roles: execution, judge, improvement. Skills can now be evaluated across multiple models for cross-model consistency. Per-skill model overrides supported in eval configs. - Cross-model evaluation mode: When multiple execution models are configured, the engine runs the full with/without pair for each model and produces a per-model score matrix.
- Judge model separation: Rubric-based grading (Layer 2) now uses a configurable judge model separate from the execution model to prevent self-grading bias.
- Self-evolving Skill Improvement Engine (Phase 10 redesign): The improvement engine now has its own knowledge base at
knowledge/improve/withlessons.md,patterns.md, andfailures.md. Before improving any skill, the engine reads its learned patterns, selects a strategy, and documents results. After each improvement batch, Phase 12 updates the improvement knowledge base -- the improvement engine evolves independently from the eval engine. - Phase 12 added: Improvement Engine Learning step that closes the improvement self-evolution loop (improve -> re-eval -> learn -> better improvements).
- Improvement pattern library seeded: 5 initial patterns (Reference Manual Slim-Down, Library-to-Instructions, Phantom Tooling Replacement, Overhead Routing, Assertion-Aligned Rewrite) with success rates from v0.3.0 data.
- Batch evaluation updated: Pipeline now includes Phase 12 (improvement learning) and per-model breakdowns in leaderboard.
Scoring System
Overall Score: 0-10
| Component | Points | Criteria |
|---|---|---|
| Quality | 0-5 | Based on with-skill pass rate |
| Value-add | 0-3 | Delta between with-skill and without-skill pass rates |
| Efficiency | 0-2 | Time/token overhead relative to baseline |
| Score | Verdict | Meaning |
|---|---|---|
| 7-10 | Recommended | Clear value over baseline |
| 5-6.9 | Conditional | Some value with trade-offs |
| 3-4.9 | Marginal | Overhead without proportional improvement |
| 0-2.9 | Not Recommended | Baseline is comparable or better |
Versioning
- Engine version in
VERSION(semver) - Each skill card records engine version
- Re-evaluations with new engine version get new cards (old preserved)
- Eval configs in
evals/<slug>.jsonare versioned implicitly through git
Batch Evaluation
- Read
evals/skill-registry.jsonandevals/models.json - Process skills sequentially (or small batches)
- For each: pre-flight -> test -> execute (per model) -> grade (with judge model) -> card -> leaderboard
- After batch: run eval learning step (Phase 8-9)
- Skill-improve pass: for any skill scoring < 7, run Phase 10-11 (improve + re-eval)
- Improvement learning step: run Phase 12 (improvement engine self-evolution)
- Final absorption: absorb both eval and improvement lessons into SKILL-EVAL.md
- Update leaderboard with both original and improved scores, per-model breakdowns