skill-evaluator

Evaluate Claude Code SKILL.md quality across 6 weighted dimensions: frontmatter quality, trigger coverage, structural completeness, content depth, consistency and integrity, and CONTRIBUTING.md compliance. Produces scored audit reports with severity-classified findings and actionable recommendations. Two modes: (1) Quick Audit — single skill, full per-dimension report with weighted scoring; (2) Full Audit — all skills in repo, comparative ranking table plus per-skill summaries. Triggers on: "evaluate skill", "audit skill quality", "score skill", "skill review", "check skill completeness", "rate this skill", "skill quality check", "grade skill", "assess skill", "skill audit", "how good is this skill", "SKILL.md feedback". Use this skill when reviewing Claude Code skills before deployment, comparing skill quality across a repo, or diagnosing why a skill fails to activate on relevant queries. NOT for evaluating or improving LLM prompts — use prompt-lab instead.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "skill-evaluator" with this command: npx skills add mathews-tom/praxis-skills/mathews-tom-praxis-skills-skill-evaluator

Skill Evaluator

Skills that do not activate on relevant queries waste the entire investment in writing them. A skill can have deep, well-structured content and still deliver zero value if its frontmatter description lacks the trigger phrases users actually type. Quality evaluation catches trigger gaps, missing sections, and shallow content before deployment — turning a skill from a static document into a reliable tool.

Reference Files

FileContents
references/evaluation-rubric.mdDetailed 1-5 scoring criteria per dimension, weight justifications, worked examples for calibration

Audit Modes

Two modes, selected by input:

  • Quick Audit: Evaluate a single skill. Produces a full per-dimension scored report with findings, severity classifications, and recommendations.
  • Full Audit: Evaluate all skills in the repository. Produces a comparative ranking table sorted by overall score, plus condensed per-skill summaries.

Mode Selection

InputMode
Path to a specific skill directory or SKILL.mdQuick Audit
"all", "every skill", no path specified, or repo-level requestFull Audit
Multiple specific pathsQuick Audit for each, then comparative summary

Evaluation Dimensions

Six dimensions, each scored 1-5. Weighted sum determines overall percentage.

D1: Frontmatter Quality (20%)

Evaluates the YAML frontmatter block for completeness and discoverability.

Signals:

  • name field present and non-empty
  • description field present and non-empty
  • Description length between 200-800 characters (sweet spot for keyword density without bloat)
  • Description contains explicit trigger phrases users would type
  • Description includes a "Use this skill when..." clause or equivalent
  • Description is keyword-dense, not generic filler

Scoring constraints: A description under 100 characters caps this dimension at 2/5. A missing name or description field caps at 1/5.

D2: Trigger Coverage (18%)

Evaluates whether the skill activates on the queries users actually type.

Signals:

  • Synonym breadth — multiple phrasings for the same intent (e.g., "review", "audit", "critique", "evaluate", "assess", "check")
  • Implied contexts — situations where the skill applies even without explicit keywords (e.g., "user provides a design doc and asks for feedback")
  • Domain-specific terms relevant to the skill's function
  • Explicit trigger phrase list in the description frontmatter
  • Coverage of both imperative ("review this") and interrogative ("is this good?") forms

Scoring constraints: Fewer than 3 distinct trigger phrases caps at 2/5. Zero trigger phrases in the description caps at 1/5.

D3: Structural Completeness (20%)

Evaluates whether the skill contains the sections needed to function reliably.

Signals:

  • Prerequisites or setup instructions (if applicable)
  • Multi-phase workflow or step-by-step procedure
  • Error handling guidance or edge case documentation
  • Output format specification (template, example, or schema)
  • Limitations or scope boundaries stated
  • Reference file table (if references/ directory exists)
  • Calibration rules or quality gates

Scoring constraints: A skill with no workflow section caps at 2/5. A skill with a workflow but no error handling or output format caps at 3/5.

D4: Content Depth (22%)

Evaluates the substantive quality of the skill's guidance — whether it provides enough detail for an agent to execute well without human intervention.

Signals:

  • Multi-step workflows with decision points, not bare command lists
  • Error cases documented with recovery actions
  • Decision frameworks (when to do X vs Y, mode selection tables)
  • Verbatim output examples or templates
  • Severity classifications or scoring rubrics (where applicable)
  • Cross-cutting analysis or synthesis steps beyond simple checklists

Scoring constraints: A skill consisting only of bare commands with no explanatory context caps at 2/5. Reference files count toward this dimension only if they contain substantive guidance (checklists, rubrics, criteria), not just link collections.

D5: Consistency and Integrity (12%)

Evaluates internal consistency and structural integrity.

Signals:

  • Directory name matches the name field in frontmatter exactly
  • All files referenced in SKILL.md exist on disk (reference files, scripts, assets)
  • Description content aligns with body content (description does not promise features the body does not deliver)
  • Consistent terminology throughout (same concept uses same term)
  • No broken internal links or dangling references
  • Self-containment: No cross-skill references (../other-skill/) in SKILL.md or reference files. Skills must be standalone packages — all referenced files must live within the skill's own directory. Shared content should use the _templates/ sync system to maintain local copies.

Scoring constraints: A name mismatch between directory and frontmatter is a CRITICAL finding and caps at 1/5. Cross-skill ../ references are a CRITICAL finding and cap at 1/5 — they break standalone packaging. Missing referenced files cap at 2/5.

D6: CONTRIBUTING.md Compliance (8%)

Evaluates adherence to the repository's contribution guidelines.

Signals:

  • Skill name is kebab-case
  • Skill name is 64 characters or fewer
  • Description is 1024 characters or fewer
  • No angle brackets in description
  • No pushy trigger language in description ("always use", "you must", "never do")
  • Valid YAML frontmatter syntax

Scoring constraints: Any single violation caps at 3/5. Multiple violations cap at 2/5. Invalid YAML that prevents parsing caps at 1/5.


Severity Classification

SeverityCriteriaScore Impact
CRITICALSkill cannot activate or breaks on load — missing frontmatter, name mismatch, invalid YAMLCaps overall score at 40%
HIGHSignificant trigger gap or missing core section — no workflow, no error handling, zero trigger phrasesCaps affected dimension at 3/5
MEDIUMWeak coverage, shallow content, few trigger synonymsDimension needs improvement but functions
LOWMinor polish — formatting inconsistencies, slightly short description, missing calibration rulesFix when convenient

Workflow

Phase 1: Input

  1. Determine audit mode from user input (see Mode Selection table above).
  2. For Quick Audit: validate the skill directory exists and contains a SKILL.md file. If the path points to a SKILL.md directly, use its parent directory.
  3. For Full Audit: enumerate all directories under skills/ that contain a SKILL.md.
  4. For each skill to evaluate, note the directory name for D5 consistency checks.

Phase 2: Analysis

For each skill under evaluation:

  1. Read SKILL.md in full.
  2. Parse YAML frontmatter — extract name and description fields. If YAML parsing fails, record a CRITICAL finding and score D1 and D6 as 1/5.
  3. Check the references/ directory for existence and contents. Verify every file referenced in the SKILL.md body exists on disk.
  4. Scan SKILL.md and all reference files for cross-skill path references (../ patterns pointing outside the skill directory). Flag any as CRITICAL D5 findings.
  5. Evaluate each of the 6 dimensions using the criteria above and the detailed rubric in references/evaluation-rubric.md.
  6. Record findings with severity, dimension tag, description, and recommendation.

Phase 3: Scoring

  1. Score each dimension 1-5 using references/evaluation-rubric.md.
  2. Apply severity caps: if any CRITICAL finding exists, cap overall at 40% regardless of dimension scores.
  3. Compute weighted score: Overall% = (sum of dimension_score x weight) / 5 x 100.
  4. Determine verdict from the scale below.
RangeVerdict
90-100%Exemplary
80-89%Strong
70-79%Adequate
60-69%Needs Work
Below 60%Deficient

Phase 4: Report

Generate the structured output using the appropriate template below.


Output Format

Quick Audit Template

## Skill Audit: {skill-name}

| Dimension | Score | Weight | Weighted | Key Finding |
|-----------|-------|--------|----------|-------------|
| D1: Frontmatter Quality | X/5 | 20% | X.XXX | ... |
| D2: Trigger Coverage | X/5 | 18% | X.XXX | ... |
| D3: Structural Completeness | X/5 | 20% | X.XXX | ... |
| D4: Content Depth | X/5 | 22% | X.XXX | ... |
| D5: Consistency & Integrity | X/5 | 12% | X.XXX | ... |
| D6: CONTRIBUTING Compliance | X/5 | 8% | X.XXX | ... |

**Overall: XX% — {Verdict}**

### Findings

[Severity-sorted list. Each entry includes dimension tag, severity, description,
evidence, and recommendation.]

- **[CRITICAL] D5:** ...
- **[HIGH] D2:** ...
- **[MEDIUM] D4:** ...
- **[LOW] D3:** ...

### Score Calculation

D1: {score} x 0.20 = {result}
D2: {score} x 0.18 = {result}
D3: {score} x 0.20 = {result}
D4: {score} x 0.22 = {result}
D5: {score} x 0.12 = {result}
D6: {score} x 0.08 = {result}
Sum = {weighted_sum}
Overall = {weighted_sum} / 5 x 100 = {percentage}% — {Verdict}

Full Audit Template

## Skill Repository Audit

| Skill | Overall | Verdict | Worst Dimension | Top Issue |
|-------|---------|---------|-----------------|-----------|
| {name} | XX% | {verdict} | {dimension} | {issue} |
| ... | ... | ... | ... | ... |

### Per-Skill Summaries

[Condensed Quick Audit for each skill: scorecard table, overall score, top 3 findings.
Omit the full Score Calculation section in condensed mode.]

Error Handling

ProblemCauseFix
SKILL.md not found in directoryPath incorrect or file missingReport as CRITICAL; do not attempt evaluation; surface the path and stop
YAML frontmatter parse failureInvalid YAML syntax (unclosed quotes, bad indentation)Report as CRITICAL finding; score D1 and D6 as 1/5; continue evaluating the body content where parseable
references/ directory missingSkill has no reference filesNot an error — score D5 normally; check only that any files referenced in the SKILL.md body actually exist on disk
references/ exists but referenced file is absentFile path in SKILL.md body doesn't resolveRecord as a CRITICAL D5 finding; missing referenced files cap D5 at 2/5
Empty SKILL.md (zero bytes or whitespace only)File created but never populatedTreat as CRITICAL; score all dimensions 1/5; overall verdict: Deficient
references/evaluation-rubric.md not foundSkill's own reference file missingNote the irony; evaluate using the criteria inline in this SKILL.md; flag D5 as a CRITICAL finding

Calibration Rules

  1. Score what exists, not what could exist — evaluate the skill as-is, not its potential.
  2. Weight trigger coverage heavily for skills targeting broad domains (e.g., a GitHub skill covers issues, PRs, CI, releases, and API — it needs proportionally more trigger synonyms).
  3. A skill with strong triggers but shallow content scores higher than deep content with poor triggers — activation is prerequisite to utility.
  4. Reference files count toward Content Depth only if they contain substantive guidance (checklists, rubrics, criteria), not link lists or stub files.
  5. When evaluating the skill-evaluator itself, apply identical standards — no self-inflation.
  6. Frontmatter description quality is the single highest-leverage improvement for any skill.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

dependency-audit

No summary provided by upstream source.

Repository SourceNeeds Review
Security

rag-auditor

No summary provided by upstream source.

Repository SourceNeeds Review
General

manuscript-review

No summary provided by upstream source.

Repository SourceNeeds Review
General

html-presentation

No summary provided by upstream source.

Repository SourceNeeds Review