Audit Agents/Skills/Commands (Advanced Skill)

Comprehensive quality audit system for Claude Code agents, skills, and commands. Provides quantitative scoring, comparative analysis, and production readiness grading based on industry best practices.

Purpose

Problem: Manual validation of agents/skills is error-prone and inconsistent. According to the LangChain Agent Report 2026, 29.5% of organizations deploy agents without systematic evaluation, leading to "agent bugs" as the top challenge (18% of teams).

Solution: Automated quality scoring across 16 weighted criteria with production readiness thresholds (80% = Grade B minimum for production deployment).

Key Features:

Quantitative scoring (32 points for agents/skills, 20 for commands)
Weighted criteria (Identity 3x, Prompt 2x, Validation 1x, Design 2x)
Production readiness grading (A-F scale with 80% threshold)
Comparative analysis vs reference templates
JSON/Markdown dual output for programmatic integration
Fix suggestions for failing criteria

Modes

Mode Usage Output

Quick Audit Top-5 critical criteria only Fast pass/fail (3-5 min for 20 files)

Full Audit All 16 criteria per file Detailed scores + recommendations (10-15 min)

Comparative Full + benchmark vs templates Analysis + gap identification (15-20 min)

Default: Full Audit (recommended for first run)

Methodology

Why These Criteria?

The 16-criteria framework is derived from:

Claude Code Best Practices (Ultimate Guide line 4921: Agent Validation Checklist)
Industry Data (LangChain Agent Report 2026: evaluation gaps)
Production Failures (Community feedback on hardcoded paths, missing error handling)
Composition Patterns (Skills should reference other skills, agents should be modular)

Scoring Philosophy

Weight Rationale:

Identity (3x): If users can't find/invoke the agent, quality is irrelevant (discoverability > quality)
Prompt (2x): Determines reliability and accuracy of outputs
Validation (1x): Improves robustness but is secondary to core functionality
Design (2x): Impacts long-term maintainability and scalability

Grade Standards:

A (90-100%): Production-ready, minimal risk
B (80-89%): Good, meets production threshold
C (70-79%): Needs improvement before production
D (60-69%): Significant gaps, not production-ready
F (<60%): Critical issues, requires major refactoring

Industry Alignment: The 80% threshold aligns with software engineering best practices for production deployment (e.g., code coverage >80%, security scan pass rates).

Workflow

Phase 1: Discovery

Scan directories:

.claude/agents/ .claude/skills/ .claude/commands/ examples/agents/ (if exists) examples/skills/ (if exists) examples/commands/ (if exists)

Classify files by type (agent/skill/command)

Load reference templates (for Comparative mode):

guide/examples/agents/ (benchmark files) guide/examples/skills/ (benchmark files) guide/examples/commands/ (benchmark files)

Phase 2: Scoring Engine

Load scoring criteria from scoring/criteria.yaml :

agents: max_points: 32 categories: identity: weight: 3 criteria: - id: A1.1 name: "Clear name" points: 3 detection: "frontmatter.name exists and is descriptive" # ... (16 total criteria)

For each file:

Parse frontmatter (YAML)
Extract content sections
Run detection patterns (regex, keyword search)
Calculate score: (points / max_points) × 100
Assign grade (A-F)

Phase 3: Comparative Analysis (Comparative Mode Only)

For each project file:

Find closest matching template (by description similarity)
Compare scores per criterion
Identify gaps: template_score - project_score
Flag significant gaps (>10 points difference)

Example:

Project file: .claude/agents/debugging-specialist.md (Score: 78%, Grade C) Closest template: examples/agents/debugging-specialist.md (Score: 94%, Grade A)

Gaps:

Anti-hallucination measures: -2 points (template has, project missing)
Edge cases documented: -1 point (template has 5 examples, project has 1)
Integration documented: -1 point (template references 3 skills, project none)

Total gap: 16 points (explains C vs A difference)

Phase 4: Report Generation

Markdown Report (audit-report.md ):

Summary table (overall + by type)
Individual scores with top issues
Detailed breakdown per file (collapsible)
Prioritized recommendations

JSON Output (audit-report.json ):

{ "metadata": { "project_path": "/path/to/project", "audit_date": "2026-02-07", "mode": "full", "version": "1.0.0" }, "summary": { "overall_score": 82.5, "overall_grade": "B", "total_files": 15, "production_ready_count": 10, "production_ready_percentage": 66.7 }, "by_type": { "agents": { "count": 5, "avg_score": 85.2, "grade": "B" }, "skills": { "count": 8, "avg_score": 78.9, "grade": "C" }, "commands": { "count": 2, "avg_score": 92.0, "grade": "A" } }, "files": [ { "path": ".claude/agents/debugging-specialist.md", "type": "agent", "score": 78.1, "grade": "C", "points_obtained": 25, "points_max": 32, "failed_criteria": [ { "id": "A2.4", "name": "Anti-hallucination measures", "points_lost": 2, "recommendation": "Add section on source verification" } ] } ], "top_issues": [ { "issue": "Missing error handling", "affected_files": 8, "impact": "Runtime failures unhandled", "priority": "high" } ] }

Phase 5: Fix Suggestions (Optional)

For each failing criterion, generate actionable fix:

File: .claude/agents/debugging-specialist.md

Issue: Missing anti-hallucination measures (2 points lost)

Fix: Add this section after "Methodology":

Source Verification

Always cite sources for technical claims
Use phrases: "According to [documentation]...", "Based on [tool output]..."
If uncertain, state: "I don't have verified information on..."
Never invent: statistics, version numbers, API signatures, stack traces

Detection: Grep for keywords: "verify", "cite", "source", "evidence"

Scoring Criteria

See scoring/criteria.yaml for complete definitions. Summary:

Agents (32 points max)

Category Weight Criteria Count Max Points

Identity 3x 4 12

Prompt Quality 2x 4 8

Validation 1x 4 4

Design 2x 4 8

Key Criteria:

Clear name (3 pts): Not generic like "agent1"
Description with triggers (3 pts): Contains "when"/"use"
Role defined (2 pts): "You are..." statement
3+ examples (1 pt): Usage scenarios documented
Single responsibility (2 pts): Focused, not "general purpose"

Skills (32 points max)

Category Weight Criteria Count Max Points

Structure 3x 4 12

Content 2x 4 8

Technical 1x 4 4

Design 2x 4 8

Key Criteria:

Valid SKILL.md (3 pts): Proper naming
Name valid (3 pts): Lowercase, 1-64 chars, no spaces
Methodology described (2 pts): Workflow section exists
No hardcoded paths (1 pt): No /Users/ , /home/
Clear triggers (2 pts): "When to use" section

Commands (20 points max)

Category Weight Criteria Count Max Points

Structure 3x 4 12

Quality 2x 4 8

Key Criteria:

Valid frontmatter (3 pts): name + description
Argument hint (3 pts): If uses $ARGUMENTS
Step-by-step workflow (3 pts): Numbered sections
Error handling (2 pts): Mentions failure modes

Detection Patterns

Frontmatter Parsing

import yaml import re

def parse_frontmatter(content): match = re.search(r'^---\n(.*?)\n---', content, re.DOTALL) if match: return yaml.safe_load(match.group(1)) return None

Keyword Detection

def has_keywords(text, keywords): text_lower = text.lower() return any(kw in text_lower for kw in keywords)

Example

has_trigger = has_keywords(description, ['when', 'use', 'trigger']) has_error_handling = has_keywords(content, ['error', 'failure', 'fallback'])

Overlap Detection (Duplication Check)

def jaccard_similarity(text1, text2): words1 = set(text1.lower().split()) words2 = set(text2.lower().split()) intersection = words1 & words2 union = words1 | words2 return len(intersection) / len(union) if union else 0

Flag if similarity > 0.5 (50% keyword overlap)

if jaccard_similarity(desc1, desc2) > 0.5: issues.append("High overlap with another file")

Token Counting (Approximate)

def estimate_tokens(text): # Rough estimate: 1 token ≈ 0.75 words word_count = len(text.split()) return int(word_count * 1.3)

Check budget

tokens = estimate_tokens(file_content) if tokens > 5000: issues.append("File too large (>5K tokens)")

Industry Context

Source: LangChain Agent Report 2026 (public report, page 14-22)

Key Findings:

29.5% of organizations deploy agents without systematic evaluation
18% cite "agent bugs" as their primary challenge
Only 12% use automated quality checks (88% manual or none)
43% report difficulty maintaining agent quality over time
Top issues: Hallucinations (31%), poor error handling (28%), unclear triggers (22%)

Implications:

Automation gap: Most teams rely on manual checklists (error-prone at scale)
Quality debt: Agents deployed without validation accumulate technical debt
Maintenance burden: 43% struggle with quality over time (no tracking system)

This skill addresses:

Automation: Replaces manual checklists with quantitative scoring
Tracking: JSON output enables trend analysis over time
Standards: 80% threshold provides clear production gate

Output Examples

Quick Audit (Top-5 Criteria)

Quick Audit: Agents/Skills/Commands

Files: 15 (5 agents, 8 skills, 2 commands) Critical Issues: 3 files fail top-5 criteria

Top-5 Criteria (Pass/Fail)

File	Valid Name	Has Triggers	Error Handling	No Hardcoded Paths	Examples
agent1.md	✅	✅	❌	✅	❌
skill2/	✅	❌	✅	❌	✅

Action Required

Add error handling: 5 files
Remove hardcoded paths: 3 files
Add usage examples: 4 files

Full Audit

See Phase 4: Report Generation above for full structure.

Comparative (Full + Benchmarks)

Comparative Audit

Project vs Templates

File	Project Score	Template Score	Gap	Top Missing
debugging-specialist.md	78% (C)	94% (A)	-16 pts	Anti-hallucination, edge cases
testing-expert/	85% (B)	91% (A)	-6 pts	Integration docs

Recommendations

Focus on these gaps to reach template quality:

Anti-hallucination measures (8 files): Add source verification sections
Edge case documentation (5 files): Add failure scenario examples
Integration documentation (4 files): List compatible agents/skills

Usage

Basic (Full Audit)

In Claude Code

Use skill: audit-agents-skills

Specify path

Use skill: audit-agents-skills for ~/projects/my-app

With Options

Quick audit (fast)

Use skill: audit-agents-skills with mode=quick

Comparative (benchmark analysis)

Use skill: audit-agents-skills with mode=comparative

Generate fixes

Use skill: audit-agents-skills with fixes=true

Custom output path

Use skill: audit-agents-skills with output=~/Desktop/audit.json

JSON Output Only

For programmatic integration

Use skill: audit-agents-skills with format=json output=audit.json

Integration with CI/CD

Pre-commit Hook

#!/bin/bash

.git/hooks/pre-commit

Run quick audit on changed agent/skill/command files

changed_files=$(git diff --cached --name-only | grep -E "^.claude/(agents|skills|commands)/")

if [ -n "$changed_files" ]; then echo "Running quick audit on changed files..." # Run audit (requires Claude Code CLI wrapper) # Exit with 1 if any file scores <80% fi

GitHub Actions

name: Audit Agents/Skills on: [pull_request] jobs: audit: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run quality audit run: | # Run audit skill # Parse JSON output # Fail if overall_score < 80

Comparison: Command vs Skill

Aspect Command (/audit-agents-skills ) Skill (this file)

Scope Current project only Multi-project, comparative

Output Markdown report Markdown + JSON

Speed Fast (5-10 min) Slower (10-20 min with comparative)

Depth Standard 16 criteria Same + benchmark analysis

Fix suggestions Via --fix flag Built-in with recommendations

Programmatic Terminal output JSON for CI/CD integration

Best for Quick checks, dev workflow Deep audits, quality tracking

Recommendation: Use command for daily checks, skill for release gates and quality tracking.

Maintenance

Updating Criteria

Edit scoring/criteria.yaml :

agents: categories: identity: criteria: - id: A1.5 # New criterion name: "API versioning specified" points: 3 detection: "mentions API version or compatibility"

Version bump: Increment version in frontmatter when criteria change.

Adding File Types

To support new file types (e.g., "workflows"):

Add to scoring/criteria.yaml : workflows: max_points: 24 categories: [...]
Update detection logic (file path patterns)
Update report templates

Command version: .claude/commands/audit-agents-skills.md
Agent Validation Checklist: guide line 4921 (manual 16 criteria)
Skill Validation: guide line 5491 (spec documentation)
Reference templates: examples/agents/ , examples/skills/ , examples/commands/

Changelog

v1.0.0 (2026-02-07):

Initial release
16-criteria framework (agents/skills/commands)
3 audit modes (quick/full/comparative)
JSON + Markdown output
Fix suggestions
Industry context (LangChain 2026 report)

Skill ready for use: audit-agents-skills

audit-agents-skills

Safety Notice

Copy this and send it to your AI assistant to learn

File: .claude/agents/debugging-specialist.md

Source Verification

Example

Flag if similarity > 0.5 (50% keyword overlap)

Check budget

Quick Audit: Agents/Skills/Commands

Top-5 Criteria (Pass/Fail)

Action Required

Comparative Audit

Project vs Templates

Recommendations

In Claude Code

Specify path

Quick audit (fast)

Comparative (benchmark analysis)

Generate fixes

Custom output path

For programmatic integration

.git/hooks/pre-commit

Run quick audit on changed agent/skill/command files

Source Transparency

Related Skills

design-patterns

rtk-optimizer

landing-page-generator

talk-stage2-research