Evaluate

Comprehensive quality evaluation for any AI-generated artifact. Produces its report as a visualization.

How It Works

┌──────────────────────────────────────────────┐ │ │ │ Phase 1: SPEC GENERATION │ │ Analyze the artifact type │ │ Generate tailored evaluation criteria │ │ Define scoring dimensions + weights │ │ Set quality gates │ │ │ │ │ ▼ │ │ Phase 2: EVALUATION │ │ Run automated checks (when possible) │ │ Visual/manual inspection │ │ Score each dimension with evidence │ │ Identify systemic vs local issues │ │ │ │ │ ▼ │ │ Phase 3: REPORT (via /visualize) │ │ Generate a beautiful HTML eval report │ │ Scores, charts, screenshots, fix list │ │ Radar chart of dimensions │ │ Before/after tracking │ │ │ └──────────────────────────────────────────────┘

Phase 1: Spec Generation

For any artifact, generate evaluation specs by analyzing:

Identify Artifact Type

HTML Visualization → visual design, interactivity, technical, content, shareability
Code/Project → correctness, readability, architecture, test coverage, performance
Document/Report → clarity, structure, accuracy, completeness, tone
Conversation/Agent → helpfulness, accuracy, tone, efficiency, safety
Slide Deck → all visualization dims + narrative flow, persuasion, pacing
Dashboard → data accuracy, information density, scannability, actionability
Custom → derive dimensions from the skill's SKILL.md and stated goals

Generate Dimensions

For each artifact type, produce 6-10 evaluation dimensions. Each dimension needs:

Name — short, clear label
Description — what this dimension measures
Weight — percentage (all weights sum to 100%)
Scoring anchors — what does a 10, 8, 6, 4 look like?
Automated checks — any programmatic tests (if applicable)
Deductions — specific issues and their point costs

Set Quality Gates

Define gates based on the artifact's purpose:

Gate Criteria Meaning

🚀 EXCEPTIONAL Overall ≥ 9.5, all ≥ 9 Best-in-class. Share everywhere.

✅ SHIP Overall ≥ 9.0, all ≥ 8 Production-ready.

⚠️ ACCEPTABLE Overall ≥ 8.0, all ≥ 7 Usable but not impressive.

🔧 NEEDS WORK Overall ≥ 7.0 or any < 7 Fix before releasing.

❌ FAIL Overall < 7.0 or any < 5 Major rework.

Output Spec Document

Write the spec to eval-spec-[artifact-name].md for reference and reuse.

Phase 2: Evaluation

For HTML Visualizations

Open in browser at 3 viewports (1280×720, 768×1024, 375×667).

Automated audit (run in browser console):

(function() { const audit = {}; const style = [...document.querySelectorAll('style')].map(s => s.textContent).join(' '); const html = document.documentElement.outerHTML;

// Structure audit.hasDoctype = /^<!doctype html>/i.test(html); audit.hasLangAttr = !!document.documentElement.lang; audit.hasCharset = !!document.querySelector('meta[charset]'); audit.hasViewport = !!document.querySelector('meta[name="viewport"]'); audit.hasTitle = document.title.length > 0;

// Menu system audit.menuExists = !!document.querySelector('.viz-menu'); audit.menuHasTheme = !!html.match(/cycleTheme|themeLabel/i); audit.menuHasDownload = !!html.match(/htmlToImage|html-to-image/i); audit.menuHasPrint = !!html.match(/window.print/i);

// Theme system audit.hasCSSVars = !!style.match(/--bg\s*:/); audit.hasDarkTheme = !!style.match(/(.theme-dark|:root)[\s\S]*?--bg/); audit.hasLightTheme = !!style.match(/.theme-light/); audit.themePersistedToStorage = !!html.match(/localStorage.*theme/i);

// Typography audit.hasInterFont = !!html.match(/fonts.googleapis.*Inter|font-family.*Inter/i); audit.hasFontFallback = !!style.match(/-apple-system|system-ui/); audit.bodyFontSize = parseFloat(getComputedStyle(document.body).fontSize); audit.bodyFontOK = audit.bodyFontSize >= 14;

// Layout audit.usesFlexOrGrid = !!(style.match(/display\s*:\s*(flex|grid)/)); audit.hasMaxWidth = !!style.match(/max-width/); audit.hasResponsiveBreakpoints = !!style.match(/@media.*max-width|@media.*min-width|sm:|md:|lg:/);

// Print & Accessibility audit.hasPrintStyles = !!style.match(/@media\s*print/); audit.hasPrintColorAdjust = !!style.match(/print-color-adjust/); audit.hasReducedMotion = !!style.match(/prefers-reduced-motion/); audit.hasAriaLabels = !!html.match(/aria-label/); audit.hasSemanticHTML = !!html.match(/<(header|main|nav|section|article|footer)/);

// Animations audit.hasKeyframes = !!style.match(/@keyframes/); audit.hasTransitions = !!style.match(/transition\s*:/);

// Performance audit.fileSizeKB = Math.round(new Blob([html]).size / 1024); audit.fileSizeOK = audit.fileSizeKB < 200; audit.noExternalImages = document.querySelectorAll('img[src^="http"]').length === 0; audit.htmlToImageLoaded = typeof htmlToImage !== 'undefined';

// Summary const bools = Object.entries(audit).filter(([k,v]) => typeof v === 'boolean'); const passed = bools.filter(([k,v]) => v).length; audit._passed = passed; audit._total = bools.length; audit._percent = Math.round(passed / bools.length * 100); audit._failures = bools.filter(([k,v]) => !v).map(([k]) => k);

console.table(audit); return audit; })();

Visual scoring — 8 dimensions for visualizations:

Dimension Weight 10 = 6 =

D1 First Impression 15% Apple keynote quality Generic template feel

D2 Typography 15% Perfect hierarchy, Inter font, fluid sizing All same size, no hierarchy

D3 Color & Contrast 10% Harmonious, WCAG AA, both themes beautiful Clashing, low contrast

D4 Layout & Spacing 15% Consistent rhythm, responsive, generous space Cramped, broken at mobile

D5 Content Quality 15% Clear message in 5 seconds, zero filler Confusing, placeholder text

D6 Interactivity 10% Menu + theme + download + print all flawless Missing features, broken

D7 Technical 10% Zero errors, semantic, accessible, print-ready Console errors, broken layout

D8 Shareability 10% Would tweet this unprompted Worse than Canva

For Code/Projects

Dimensions: Correctness, Readability, Architecture, Error Handling, Performance, Testing, Documentation, Security

For Documents

Dimensions: Clarity, Structure, Accuracy, Completeness, Tone, Formatting, Actionability, Brevity

For Agent Conversations

Dimensions: Helpfulness, Accuracy, Tone, Efficiency, Safety, Context Awareness, Tool Usage, Follow-through

Phase 3: Visual Report (via /visualize)

After scoring, generate the eval report as a beautiful HTML dashboard using the visualize skill:

Report Structure

Hero — artifact name, overall score (big number), quality gate badge
Radar Chart — all dimensions plotted on a radar/spider chart (Chart.js)
Dimension Cards — each dimension as a card with score, bar, key notes
Automated Audit — pass/fail checklist with percentages
Screenshots — key views embedded (if HTML artifact)
Fix List — prioritized fixes as a kanban-style layout (critical / high / medium / low)
Systemic Issues — patterns that affect all outputs (flagged for SKILL.md fixes)
History — if re-evaluating, show before/after score comparison chart

Report Filename

eval-report-[artifact-name]-[date].html

The report itself must score ≥ 9.0 on the visualize eval criteria.

This is the ultimate dogfood test — our evaluation tool produces evaluations using our visualization tool.

The Improvement Loop

Generate artifact (any skill) ↓ /evaluate → Spec + Score + Visual Report ↓ Review report → identify fixes ↓ Fix (systemic → SKILL.md, local → artifact) ↓ /evaluate again → compare scores ↓ Ship when gate = SHIP or EXCEPTIONAL

Max 3 loops per artifact. If it can't reach SHIP in 3 loops, the problem is in the skill — update the skill's instructions, not the artifact.

Quick Start

Evaluate a visualization

/evaluate path/to/visualization.html

Evaluate with custom context

/evaluate path/to/code-project --type code

Re-evaluate after fixes (tracks improvement)

/evaluate path/to/visualization.html --loop 2

Generate specs only (no scoring)

/evaluate --specs-only --type dashboard

evaluate

Safety Notice

Copy this and send it to your AI assistant to learn

Evaluate a visualization

Evaluate with custom context

Re-evaluate after fixes (tracks improvement)

Generate specs only (no scoring)

Source Transparency

Related Skills

visualize

SERP Outline Extractor

Multi-Model Response Comparator