Audit Agents/Skills/Commands (Advanced Skill)
Comprehensive quality audit system for Claude Code agents, skills, and commands. Provides quantitative scoring, comparative analysis, and production readiness grading based on industry best practices.
Purpose
Problem: Manual validation of agents/skills is error-prone and inconsistent. According to the LangChain Agent Report 2026, 29.5% of organizations deploy agents without systematic evaluation, leading to "agent bugs" as the top challenge (18% of teams).
Solution: Automated quality scoring across 16 weighted criteria with production readiness thresholds (80% = Grade B minimum for production deployment).
Key Features:
-
Quantitative scoring (32 points for agents/skills, 20 for commands)
-
Weighted criteria (Identity 3x, Prompt 2x, Validation 1x, Design 2x)
-
Production readiness grading (A-F scale with 80% threshold)
-
Comparative analysis vs reference templates
-
JSON/Markdown dual output for programmatic integration
-
Fix suggestions for failing criteria
Modes
Mode Usage Output
Quick Audit Top-5 critical criteria only Fast pass/fail (3-5 min for 20 files)
Full Audit All 16 criteria per file Detailed scores + recommendations (10-15 min)
Comparative Full + benchmark vs templates Analysis + gap identification (15-20 min)
Default: Full Audit (recommended for first run)
Methodology
Why These Criteria?
The 16-criteria framework is derived from:
-
Claude Code Best Practices (Ultimate Guide line 4921: Agent Validation Checklist)
-
Industry Data (LangChain Agent Report 2026: evaluation gaps)
-
Production Failures (Community feedback on hardcoded paths, missing error handling)
-
Composition Patterns (Skills should reference other skills, agents should be modular)
Scoring Philosophy
Weight Rationale:
-
Identity (3x): If users can't find/invoke the agent, quality is irrelevant (discoverability > quality)
-
Prompt (2x): Determines reliability and accuracy of outputs
-
Validation (1x): Improves robustness but is secondary to core functionality
-
Design (2x): Impacts long-term maintainability and scalability
Grade Standards:
-
A (90-100%): Production-ready, minimal risk
-
B (80-89%): Good, meets production threshold
-
C (70-79%): Needs improvement before production
-
D (60-69%): Significant gaps, not production-ready
-
F (<60%): Critical issues, requires major refactoring
Industry Alignment: The 80% threshold aligns with software engineering best practices for production deployment (e.g., code coverage >80%, security scan pass rates).
Workflow
Phase 1: Discovery
Scan directories:
.claude/agents/ .claude/skills/ .claude/commands/ examples/agents/ (if exists) examples/skills/ (if exists) examples/commands/ (if exists)
Classify files by type (agent/skill/command)
Load reference templates (for Comparative mode):
guide/examples/agents/ (benchmark files) guide/examples/skills/ (benchmark files) guide/examples/commands/ (benchmark files)
Phase 2: Scoring Engine
Load scoring criteria from scoring/criteria.yaml :
agents: max_points: 32 categories: identity: weight: 3 criteria: - id: A1.1 name: "Clear name" points: 3 detection: "frontmatter.name exists and is descriptive" # ... (16 total criteria)
For each file:
-
Parse frontmatter (YAML)
-
Extract content sections
-
Run detection patterns (regex, keyword search)
-
Calculate score: (points / max_points) × 100
-
Assign grade (A-F)
Phase 3: Comparative Analysis (Comparative Mode Only)
For each project file:
-
Find closest matching template (by description similarity)
-
Compare scores per criterion
-
Identify gaps: template_score - project_score
-
Flag significant gaps (>10 points difference)
Example:
Project file: .claude/agents/debugging-specialist.md (Score: 78%, Grade C) Closest template: examples/agents/debugging-specialist.md (Score: 94%, Grade A)
Gaps:
- Anti-hallucination measures: -2 points (template has, project missing)
- Edge cases documented: -1 point (template has 5 examples, project has 1)
- Integration documented: -1 point (template references 3 skills, project none)
Total gap: 16 points (explains C vs A difference)
Phase 4: Report Generation
Markdown Report (audit-report.md ):
-
Summary table (overall + by type)
-
Individual scores with top issues
-
Detailed breakdown per file (collapsible)
-
Prioritized recommendations
JSON Output (audit-report.json ):
{ "metadata": { "project_path": "/path/to/project", "audit_date": "2026-02-07", "mode": "full", "version": "1.0.0" }, "summary": { "overall_score": 82.5, "overall_grade": "B", "total_files": 15, "production_ready_count": 10, "production_ready_percentage": 66.7 }, "by_type": { "agents": { "count": 5, "avg_score": 85.2, "grade": "B" }, "skills": { "count": 8, "avg_score": 78.9, "grade": "C" }, "commands": { "count": 2, "avg_score": 92.0, "grade": "A" } }, "files": [ { "path": ".claude/agents/debugging-specialist.md", "type": "agent", "score": 78.1, "grade": "C", "points_obtained": 25, "points_max": 32, "failed_criteria": [ { "id": "A2.4", "name": "Anti-hallucination measures", "points_lost": 2, "recommendation": "Add section on source verification" } ] } ], "top_issues": [ { "issue": "Missing error handling", "affected_files": 8, "impact": "Runtime failures unhandled", "priority": "high" } ] }
Phase 5: Fix Suggestions (Optional)
For each failing criterion, generate actionable fix:
File: .claude/agents/debugging-specialist.md
Issue: Missing anti-hallucination measures (2 points lost)
Fix: Add this section after "Methodology":
Source Verification
- Always cite sources for technical claims
- Use phrases: "According to [documentation]...", "Based on [tool output]..."
- If uncertain, state: "I don't have verified information on..."
- Never invent: statistics, version numbers, API signatures, stack traces
Detection: Grep for keywords: "verify", "cite", "source", "evidence"
Scoring Criteria
See scoring/criteria.yaml for complete definitions. Summary:
Agents (32 points max)
Category Weight Criteria Count Max Points
Identity 3x 4 12
Prompt Quality 2x 4 8
Validation 1x 4 4
Design 2x 4 8
Key Criteria:
-
Clear name (3 pts): Not generic like "agent1"
-
Description with triggers (3 pts): Contains "when"/"use"
-
Role defined (2 pts): "You are..." statement
-
3+ examples (1 pt): Usage scenarios documented
-
Single responsibility (2 pts): Focused, not "general purpose"
Skills (32 points max)
Category Weight Criteria Count Max Points
Structure 3x 4 12
Content 2x 4 8
Technical 1x 4 4
Design 2x 4 8
Key Criteria:
-
Valid SKILL.md (3 pts): Proper naming
-
Name valid (3 pts): Lowercase, 1-64 chars, no spaces
-
Methodology described (2 pts): Workflow section exists
-
No hardcoded paths (1 pt): No /Users/ , /home/
-
Clear triggers (2 pts): "When to use" section
Commands (20 points max)
Category Weight Criteria Count Max Points
Structure 3x 4 12
Quality 2x 4 8
Key Criteria:
-
Valid frontmatter (3 pts): name + description
-
Argument hint (3 pts): If uses $ARGUMENTS
-
Step-by-step workflow (3 pts): Numbered sections
-
Error handling (2 pts): Mentions failure modes
Detection Patterns
Frontmatter Parsing
import yaml import re
def parse_frontmatter(content): match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL) if match: return yaml.safe_load(match.group(1)) return None
Keyword Detection
def has_keywords(text, keywords): text_lower = text.lower() return any(kw in text_lower for kw in keywords)
Example
has_trigger = has_keywords(description, ['when', 'use', 'trigger']) has_error_handling = has_keywords(content, ['error', 'failure', 'fallback'])
Overlap Detection (Duplication Check)
def jaccard_similarity(text1, text2): words1 = set(text1.lower().split()) words2 = set(text2.lower().split()) intersection = words1 & words2 union = words1 | words2 return len(intersection) / len(union) if union else 0
Flag if similarity > 0.5 (50% keyword overlap)
if jaccard_similarity(desc1, desc2) > 0.5: issues.append("High overlap with another file")
Token Counting (Approximate)
def estimate_tokens(text): # Rough estimate: 1 token ≈ 0.75 words word_count = len(text.split()) return int(word_count * 1.3)
Check budget
tokens = estimate_tokens(file_content) if tokens > 5000: issues.append("File too large (>5K tokens)")
Industry Context
Source: LangChain Agent Report 2026 (public report, page 14-22)
Key Findings:
-
29.5% of organizations deploy agents without systematic evaluation
-
18% cite "agent bugs" as their primary challenge
-
Only 12% use automated quality checks (88% manual or none)
-
43% report difficulty maintaining agent quality over time
-
Top issues: Hallucinations (31%), poor error handling (28%), unclear triggers (22%)
Implications:
-
Automation gap: Most teams rely on manual checklists (error-prone at scale)
-
Quality debt: Agents deployed without validation accumulate technical debt
-
Maintenance burden: 43% struggle with quality over time (no tracking system)
This skill addresses:
-
Automation: Replaces manual checklists with quantitative scoring
-
Tracking: JSON output enables trend analysis over time
-
Standards: 80% threshold provides clear production gate
Output Examples
Quick Audit (Top-5 Criteria)
Quick Audit: Agents/Skills/Commands
Files: 15 (5 agents, 8 skills, 2 commands) Critical Issues: 3 files fail top-5 criteria
Top-5 Criteria (Pass/Fail)
| File | Valid Name | Has Triggers | Error Handling | No Hardcoded Paths | Examples |
|---|---|---|---|---|---|
| agent1.md | ✅ | ✅ | ❌ | ✅ | ❌ |
| skill2/ | ✅ | ❌ | ✅ | ❌ | ✅ |
Action Required
- Add error handling: 5 files
- Remove hardcoded paths: 3 files
- Add usage examples: 4 files
Full Audit
See Phase 4: Report Generation above for full structure.
Comparative (Full + Benchmarks)
Comparative Audit
Project vs Templates
| File | Project Score | Template Score | Gap | Top Missing |
|---|---|---|---|---|
| debugging-specialist.md | 78% (C) | 94% (A) | -16 pts | Anti-hallucination, edge cases |
| testing-expert/ | 85% (B) | 91% (A) | -6 pts | Integration docs |
Recommendations
Focus on these gaps to reach template quality:
- Anti-hallucination measures (8 files): Add source verification sections
- Edge case documentation (5 files): Add failure scenario examples
- Integration documentation (4 files): List compatible agents/skills
Usage
Basic (Full Audit)
In Claude Code
Use skill: audit-agents-skills
Specify path
Use skill: audit-agents-skills for ~/projects/my-app
With Options
Quick audit (fast)
Use skill: audit-agents-skills with mode=quick
Comparative (benchmark analysis)
Use skill: audit-agents-skills with mode=comparative
Generate fixes
Use skill: audit-agents-skills with fixes=true
Custom output path
Use skill: audit-agents-skills with output=~/Desktop/audit.json
JSON Output Only
For programmatic integration
Use skill: audit-agents-skills with format=json output=audit.json
Integration with CI/CD
Pre-commit Hook
#!/bin/bash
.git/hooks/pre-commit
Run quick audit on changed agent/skill/command files
changed_files=$(git diff --cached --name-only | grep -E "^.claude/(agents|skills|commands)/")
if [ -n "$changed_files" ]; then echo "Running quick audit on changed files..." # Run audit (requires Claude Code CLI wrapper) # Exit with 1 if any file scores <80% fi
GitHub Actions
name: Audit Agents/Skills on: [pull_request] jobs: audit: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run quality audit run: | # Run audit skill # Parse JSON output # Fail if overall_score < 80
Comparison: Command vs Skill
Aspect Command (/audit-agents-skills ) Skill (this file)
Scope Current project only Multi-project, comparative
Output Markdown report Markdown + JSON
Speed Fast (5-10 min) Slower (10-20 min with comparative)
Depth Standard 16 criteria Same + benchmark analysis
Fix suggestions Via --fix flag Built-in with recommendations
Programmatic Terminal output JSON for CI/CD integration
Best for Quick checks, dev workflow Deep audits, quality tracking
Recommendation: Use command for daily checks, skill for release gates and quality tracking.
Maintenance
Updating Criteria
Edit scoring/criteria.yaml :
agents: categories: identity: criteria: - id: A1.5 # New criterion name: "API versioning specified" points: 3 detection: "mentions API version or compatibility"
Version bump: Increment version in frontmatter when criteria change.
Adding File Types
To support new file types (e.g., "workflows"):
-
Add to scoring/criteria.yaml : workflows: max_points: 24 categories: [...]
-
Update detection logic (file path patterns)
-
Update report templates
Related
-
Command version: .claude/commands/audit-agents-skills.md
-
Agent Validation Checklist: guide line 4921 (manual 16 criteria)
-
Skill Validation: guide line 5491 (spec documentation)
-
Reference templates: examples/agents/ , examples/skills/ , examples/commands/
Changelog
v1.0.0 (2026-02-07):
-
Initial release
-
16-criteria framework (agents/skills/commands)
-
3 audit modes (quick/full/comparative)
-
JSON + Markdown output
-
Fix suggestions
-
Industry context (LangChain 2026 report)
Skill ready for use: audit-agents-skills