skill-evaluator

Evaluate Agent Skills against agentskills.io specification with three progressive modes and smart visual reports. (1) Static Analysis - SKILL.md-only review for quality and spec compliance, outputs score /60. (2) Semi-Static Analysis - adds environment and user fit assessment without execution, outputs score /100. (3) Full Analysis - complete evaluation with security scanning, trigger testing, and dynamic verification, outputs score /130. Supports Bento-ready JSON report output for visual dashboards with auto-scaling blocks based on issue severity. Trigger phrases include "evaluate skill", "review skill", "audit skill", "is this skill good", "should I use/install this skill", "skill quality check", "rate this skill", "score this skill", "bento report", "visual report", "技能评估", "评测 skill", "审核 skill", "可视化报告".

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "skill-evaluator" with this command: npx skills add alterxyz/skills-bar-skills/alterxyz-skills-bar-skills-skill-evaluator

Skill Evaluator

Evaluate Agent Skills against agentskills.io specification and best practices.


Core Philosophy

Good Skill = Expert-only Knowledge − What the LLM Already Knows

Knowledge TypeTreatmentExample
Expert (LLM doesn't know)Keep — this is value"mediabox not cropbox for PDF size"
Activation (LLM may forget)Keep if brief"Always validate XML before packing"
Redundant (LLM knows)Delete — token waste"What is a PDF file"

Before Evaluating, Ask Yourself

  1. "Does this skill capture knowledge that took someone years to learn?" — If no, low D1 score.
  2. "Would an expert read this and nod, or roll their eyes?" — Eye-roll = redundant content.
  3. "After reading, can the LLM do something it couldn't before?" — If just faster/reminded, marginal value.

Mode Selection

ModeWhen to UseTimeInputOutput
StaticQuick check, bulk screening, first pass~2 minSKILL.md only/60
Semi-StaticInstall decision, fit check~5 min+ Environment/User info/100
FullProduction deploy, security audit~15 min+ Complete package/130

Default workflow: Static first → If score >60 AND installation considered → Semi-Static → If deploying to production → Full.


Mode 1: Static Analysis

Input: SKILL.md only
Output: Gate check (Pass/Fail) + Quality Score /60

Gate Check (All Must Pass)

Any gate failure = immediate reject. Do not proceed to scoring.

GateRequirementCommon Failures
G1: YAMLValid frontmatter with name + descriptionMissing ---, no description
G2: Name1-64 chars, [a-z0-9-] only, no reserved wordsUppercase, claude-, anthropic-
G3: Description1-1024 chars, third-person, no placeholdersUses "I/you/my", contains "TODO"

Quality Dimensions (60 points)

D1: Knowledge Delta (20 pts) — MOST IMPORTANT

"If I delete this, would the LLM perform noticeably worse?"

ScoreIndicator
16-20Pure expert knowledge: trade-offs, decision trees, non-obvious sequences, "NEVER X because [surprising reason]"
11-15Mostly expert: some activation knowledge mixed in
6-10Mixed: useful bits buried in tutorials
0-5Redundant: docs the LLM already knows, "What is X" sections

Red flags (subtract points):

  • Library tutorials ("How to use pandas")
  • Generic best practices ("Write clean code")
  • Definitions the LLM knows ("PDF is Portable Document Format")

Green flags (add points):

  • "When X and Y, choose Z because..."
  • "NEVER do X — it causes [non-obvious problem]"
  • Specific numbers/thresholds ("Scale UP not down for text clarity")
  • Domain-specific sequences the LLM would get wrong

D2: Mindset + Procedures (15 pts)

ScoreIndicator
12-15Shapes thinking: "Before X, ask yourself...", expert decision frameworks
8-11Domain workflows + some thinking patterns
4-7Procedures only, no mental models
0-3Generic steps ("1. Open file 2. Process 3. Save")

Look for: Diagnostic questions, priority rules, "The expert's first question is always..."

D3: Anti-Patterns (10 pts)

ScoreIndicator
9-10Comprehensive NEVER list with non-obvious reasons
6-8Specific warnings, some reasons
3-5Vague warnings ("Be careful with...")
0-2No anti-patterns mentioned

What counts: Specific, actionable, with surprising consequences. "NEVER use Inter font — dead giveaway of AI-generated" beats "Choose fonts carefully."

D4: Structure & Economy (15 pts)

ScoreIndicator
12-15<300 lines, excellent progressive disclosure, clear {baseDir} references
8-11300-500 lines, uses references/, has loading triggers
4-7500-800 lines, some structure
0-3>800 lines, monolithic, no disclosure

Check for:

  • {baseDir}/references/ paths with clear "when to load" instructions
  • "MANDATORY if [condition]: Read..." triggers
  • "Do NOT preload" markers for large references

Static report template: See {baseDir}/references/templates.md#static-report


Mode 2: Semi-Static Analysis

Input: SKILL.md + Target Environment + User Level
Output: Static score + Fit scores = /100

Step 1: Complete Static Analysis

Run Mode 1 first. If gates fail, stop.

Step 2: Environment Fit (20 pts)

EnvironmentShellFilesNetworkscripts/
claude.ai/WebUpload only
Claude Desktop⚠️⚠️⚠️
Coding Agent/CLI✅ Workspace
IDE Extension✅ Workspace⚠️
EnterprisePolicyPolicyPolicyPolicy
ScoreCriteria
16-20Fully compatible, all features work
11-15Mostly works, minor limitations
6-10Partial, requires workarounds or degraded mode
0-5Incompatible, core features won't work

Step 3: User Fit (20 pts)

User LevelSignalsSkill Should Provide
Novice"I'm new to...", asks basicsGuided workflows, examples, guardrails
IntermediateUses terms correctly, asks "how to optimize"Efficiency, concepts, options
ExpertAsks about internals, wants customizationControl, extensibility, raw power
ScoreCriteria
16-20Perfect audience fit
11-15Good fit, acceptable learning curve
6-10Partial fit, friction expected
0-5Wrong audience entirely

Semi-static report template: See {baseDir}/references/templates.md#semi-static-report


Mode 3: Full Analysis

Input: Complete skill package + Test environment
Output: Comprehensive score /130

Step 1: Complete Semi-Static

Run Modes 1 and 2 first.

Step 2: Security Scan (Gate — Must Pass)

MANDATORY: Read {baseDir}/references/security-scan-spec.md before scanning.

Run static scan:

python {baseDir}/scripts/security_scan.py /path/to/skill
# or: node {baseDir}/scripts/security_scan.js /path/to/skill
# or: bash {baseDir}/scripts/security_scan.sh /path/to/skill

Any HIGH severity finding = instant fail. Do not proceed.

For semantic analysis (obfuscation, prompt injection, data flow): Read {baseDir}/references/security-scan-llm.md and use the LLM scan prompt.

Step 3: Trigger Testing (10 pts)

Test 5 prompt types:

TypeExampleExpected
Direct"Use [skill-name] to..."Trigger
Keyword"[feature word] my file"Trigger
Indirect"[Problem the skill solves]"Trigger
AmbiguousVague related requestMaybe trigger
NegativeUnrelated taskNOT trigger
ScoreCriteria
9-10Reliable triggers, zero false positives
7-8Usually triggers, rare false positives
4-6Sometimes triggers, some false positives
0-3Unreliable or excessive false positives

Step 4: Functional Tests (20 pts)

CategoryPointsWhat to Test
Happy path/8Core use cases work correctly
Edge cases/6Unusual inputs, boundary conditions
Error handling/3Graceful failures, helpful messages
Output quality/3Results match expert expectations

Full report template: See {baseDir}/references/templates.md#full-report


NEVER Do (Evaluator Anti-Patterns)

  1. NEVER score D1 high for tutorials — "How to use library X" is not expert knowledge, even if well-written.

  2. NEVER ignore token cost — An 800-line skill that could be 200 lines is wasting 75% of context window.

  3. NEVER pass security for "educational" shell=True — Legitimate purposes don't justify vulnerabilities.

  4. NEVER assume environment — A skill perfect for CLI is worthless in claude.ai web.

  5. NEVER conflate "comprehensive" with "good" — More content ≠ more value. Density matters.

  6. NEVER skip gate checks — A skill with invalid YAML shouldn't get quality scores.


Common Failure Patterns → Fixes

PatternSymptomRoot CauseFix
TutorialLow D1, "What is X" sectionsAuthor wrote for humans, not LLMsDelete all content LLM already knows
Dump>800 lines, no structureNo progressive disclosureSplit to references/, add loading triggers
Orphan Referencesreferences/ exists but never loadedMissing "when to read" instructionsAdd explicit "MANDATORY if [X]: Read..."
InvisibleNever triggersBad descriptionMove ALL trigger info to description, add keywords
Wrong Location"When to Use" in bodyBody loads AFTER trigger decisionDescription = when, Body = how
Vague Warnings"Be careful with X"No actionable anti-patternsSpecific NEVER + surprising consequence

Quick Reference: Scoring Cheatsheet

D1 (Knowledge Delta):     "Would deleting this make LLM worse?"
D2 (Mindset):             "Does it shape HOW to think, not just WHAT to do?"
D3 (Anti-Patterns):       "Specific NEVERs with surprising reasons?"
D4 (Structure):           "<300 lines? Progressive disclosure? Loading triggers?"
Environment:              "Will core features actually work?"
User:                     "Right audience? Right complexity level?"

The Meta-Question

After every evaluation, ask:

"Would an expert in this domain say: 'Yes, this captures knowledge that took me years to learn'?"

  • Yes → The skill has value.
  • No → The skill is compressing what the LLM already knows.

Bento-Ready Report Output

For visual dashboard integration, generate JSON reports where evaluation logic directly determines visual weight.

Core Principles

异常放大,正常收敛 — Information density scales with deviation severity.

Speak human, not framework — Users haven't read our evaluation docs. No D1/D2/G1 jargon.

StatusDisplay Strategy
Normal score (≥80%)Compact block, headline only
Notable issue (<60%)Expanded block with detail + action
Critical issue (<30% or security)Prominent block with full evidence

Output Formats

RequestOutput
Standard evaluationMarkdown report (see {baseDir}/references/templates.md)
"bento report" / "visual report" / "JSON report"Bento-ready JSON

Bento Report Structure

{
  "meta": { "skillName": "...", "totalScore": 48, "maxScore": 60 },
  "summary": { "verdict": "good", "oneLiner": "Ready to use with minor improvements." },
  "blocks": [
    {
      "id": "expert-knowledge",
      "type": "score",
      "importance": { "level": "normal", "reason": "Good expert content with minor redundancy" },
      "layout": { "size": "default" },
      "content": { 
        "headline": "Expert Knowledge — Good ✓",
        "detail": "Contains valuable professional insights. Some basic tutorials could be trimmed."
      }
    }
  ]
}

Importance → Layout Mapping

ImportanceLayoutTrigger
criticalprominentGate fail, security issue, score <30%
notableexpandedScore <60%, outlier disparity
normaldefaultScore 60-80%
minorcompactScore ≥80%, all gates pass

Visual Assets

Add visual_asset for critical/notable blocks:

  • Charts: gauge (header), radar (quality overview)
  • Illustrations: warning style for security issues
  • Badges: verdict display

Always include fallback.text for image generation failures.

MANDATORY for bento reports: Read {baseDir}/references/bento-report-schema.md for complete JSON schema.

MANDATORY for bento generation: Read {baseDir}/references/bento-report-instruction.md for generation rules.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

skill-evaluator

No summary provided by upstream source.

Repository SourceNeeds Review
General

skill-evaluator

No summary provided by upstream source.

Repository SourceNeeds Review
General

skill-evaluator

No summary provided by upstream source.

Repository SourceNeeds Review
Security

Vuln Briefing

Generate daily vulnerability briefings from NIST NVD, CISA KEV, and security advisories. Aggregates, scores, and formats CVE data into actionable reports. No...

Registry SourceRecently Updated