skill-grader

Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "skill-grader" with this command: npx skills add erichowens/some_claude_skills/erichowens-some-claude-skills-skill-grader

Skill Grader

Structured evaluation rubric for Claude Agent Skills. Produces letter grades (A+ through F) on 10 axes plus an overall grade, with specific improvement recommendations for each axis.

Designed for sub-agents and non-expert reviewers who need a mechanical, repeatable process for assessing skill quality without deep domain expertise.

When to Use

✅ Use for:

  • Auditing a single skill's quality

  • Comparing skills against each other

  • Prioritizing which skills to improve first

  • Quality control sweeps across a skill library

  • Generating improvement roadmaps

❌ NOT for:

  • Creating new skills (use skill-architect )

  • Grading code quality or non-skill documents

  • Evaluating agent performance (different from skill quality)

Grading Process

flowchart TD A[Read SKILL.md + all files] --> B[Score each of 10 axes] B --> C[Assign letter grade per axis] C --> D[Compute overall grade] D --> E[Write improvement recommendations] E --> F[Produce grading report]

Step-by-Step

  • Read the entire skill folder — SKILL.md, all references, scripts, CHANGELOG, README

  • Score each axis — Use the rubric below (0-100 per axis)

  • Convert to letter grade — See grade scale

  • Compute overall grade — Weighted average (Description and Scope are 2x weight)

  • Write 1-3 specific improvements per axis scoring below B+

  • Produce the grading report in the output format below

The 10 Evaluation Axes

Axis 1: Description Quality (Weight: 2x)

Does the description follow [What] [When] [Keywords]. NOT for [Exclusions] ?

Grade Criteria

A Specific verb+noun, domain keywords users would type, 2-5 explicit NOT exclusions, 25-50 words

B Has keywords and NOT clause, but slightly vague or missing synonym coverage

C Too generic, missing NOT clause, or >100 words of process detail

D Single vague sentence ("helps with X") or name/description mismatch

F Missing or empty description

Axis 2: Scope Discipline (Weight: 2x)

Is the skill narrowly focused on one expertise type, or a catch-all?

Grade Criteria

A One clear expertise domain, "When to Use" and "NOT for" sections both present and specific

B Mostly focused, minor boundary ambiguity

C Covers 2-3 related but distinct domains, should probably be split

D Catch-all skill ("helps with anything related to X")

F No scope boundaries defined at all

Axis 3: Progressive Disclosure

Does the skill follow the three-layer architecture (metadata → SKILL.md → references)?

Grade Criteria

A SKILL.md <300 lines, heavy content in references, reference index in SKILL.md with 1-line descriptions

B SKILL.md <500 lines, some references used, index present

C SKILL.md >500 lines, or all content inlined with no references

D SKILL.md >800 lines, or references exist but aren't indexed in SKILL.md

F Single massive file with no structure

Axis 4: Anti-Pattern Coverage

Does the skill encode expert knowledge that prevents common mistakes?

Grade Criteria

A 3+ anti-patterns with Novice/Expert/Timeline template, LLM-mistake notes

B 1-2 anti-patterns with clear explanation

C Anti-patterns mentioned but no structured template

D No anti-patterns, just positive instructions

F Contains advice that IS an anti-pattern (outdated, harmful)

Axis 5: Self-Contained Tools

Does the skill include working tools (scripts, MCPs, subagents)?

Grade Criteria

A Working scripts with CLI interface, error handling, dependency docs; OR valid "no tools needed" justification

B Scripts exist and work but lack error handling or docs

C Scripts referenced but are templates/pseudocode

D Phantom tools (SKILL.md references files that don't exist)

F References non-existent tools AND no acknowledgment

Note: Not every skill needs tools. A pure decision-tree skill can score A if tools aren't applicable.

Axis 6: Activation Precision

Would the skill activate correctly on relevant queries and stay silent on irrelevant ones?

Grade Criteria

A Description has specific keywords matching user language, clear NOT clause, no obvious false-positive vectors

B Good keywords, minor false-positive risk

C Generic keywords that overlap with other skills

D No specific keywords, or NOT clause contradicts intended use

F Description would cause constant false activation

Axis 7: Visual Artifacts

Does the skill use Mermaid diagrams, code examples, and tables effectively?

Grade Criteria

A Decision trees as Mermaid flowcharts, tables for comparisons, code examples for concrete patterns

B Some diagrams or tables, but key decision trees still in prose

C Tables used but no Mermaid diagrams for processes

D Prose-only, no visual structure

F Wall of text with no formatting aids

Axis 8: Output Contracts

Does the skill define what it produces in a format consumable by other agents?

Grade Criteria

A Explicit output format (JSON schema, markdown template, or structured sections), subagent-consumable

B Output format implied but not explicitly documented

C No output format, but content is structured enough to infer

D Unstructured prose output expected

F N/A (pure reference skill) — exempt from this axis

Axis 9: Temporal Awareness

Does the skill track when knowledge was current and what has changed?

Grade Criteria

A Timelines in anti-patterns, "as of [date]" markers, CHANGELOG with dates

B Some temporal context, CHANGELOG exists

C No dates on knowledge, but CHANGELOG exists

D No temporal context anywhere, knowledge could be stale

F Contains demonstrably outdated advice without warning

Axis 10: Documentation Quality

README, CHANGELOG, and reference organization.

Grade Criteria

A README with quick start, CHANGELOG with dated versions, references well-organized with clear filenames

B README and CHANGELOG exist, references present

C SKILL.md is the only file, but it's well-structured

D No README, no CHANGELOG, disorganized references

F SKILL.md is the only file and it's poorly structured

Grade Scale

Letter Score Range Meaning

A+ 97-100 Exemplary — sets the standard

A 93-96 Excellent — minor improvements possible

A- 90-92 Very good — a few small gaps

B+ 87-89 Good — notable room for improvement

B 83-86 Solid — several areas need work

B- 80-82 Above average — meaningful gaps

C+ 77-79 Average — significant improvements needed

C 73-76 Below average — major gaps

C- 70-72 Barely adequate

D+ 67-69 Poor — fundamental issues

D 63-66 Very poor — needs major rework

D- 60-62 Near-failing quality

F <60 Failing — start over

Overall Grade Computation

Axes 1 (Description) and 2 (Scope) carry 2x weight. All others carry 1x weight. If Axis 8 (Output Contracts) is marked exempt, remove it from the calculation.

Overall = (2×Axis1 + 2×Axis2 + Axis3 + Axis4 + Axis5 + Axis6 + Axis7 + Axis8 + Axis9 + Axis10) / 12

Convert the numeric average to a letter grade using the scale above.

Output Format

Produce this exact structure:

Skill Grading Report: [skill-name]

Graded: [date] Overall Grade: [letter] ([score]/100)

Axis Grades

#AxisGradeScoreKey Finding
1Description Quality[grade][score][1-line finding]
2Scope Discipline[grade][score][1-line finding]
3Progressive Disclosure[grade][score][1-line finding]
4Anti-Pattern Coverage[grade][score][1-line finding]
5Self-Contained Tools[grade][score][1-line finding]
6Activation Precision[grade][score][1-line finding]
7Visual Artifacts[grade][score][1-line finding]
8Output Contracts[grade][score][1-line finding]
9Temporal Awareness[grade][score][1-line finding]
10Documentation Quality[grade][score][1-line finding]

Top 3 Improvements (Highest Impact)

  1. [Axis]: [Specific action] — [Why this matters, expected grade improvement]
  2. [Axis]: [Specific action] — [Why this matters, expected grade improvement]
  3. [Axis]: [Specific action] — [Why this matters, expected grade improvement]

Detailed Notes

[Axis name] ([grade])

[2-3 sentences of specific feedback with examples from the skill]

[Repeat for each axis scoring below B+]

Quick Grading (Abbreviated)

For rapid triage across many skills, produce only:

SkillOverallDescScopeDiscAntiToolsActivVisualOutputTempDocs
[name][grade]..............................

Anti-Patterns in Grading

Grade Inflation

Wrong: Giving B+ because "it's pretty good" without checking criteria. Right: Match observations to the rubric table literally. If the description lacks a NOT clause, it cannot score above C on Axis 1.

Missing Context

Wrong: Grading a pure decision-tree skill poorly on Axis 5 (tools) because it has no scripts. Right: Mark Axis 5 as "A — tools not applicable for this skill type."

Ignoring Phantoms

Wrong: Scoring Axis 5 as B because scripts are "referenced." Right: Actually check if every referenced file exists. If scripts/validate.py is mentioned but doesn't exist, that's D.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

agent-creator

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

test-automation-expert

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

chatbot-analytics

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

dag-task-scheduler

No summary provided by upstream source.

Repository SourceNeeds Review