Checklist Purpose: "Unit Tests for English"
CRITICAL CONCEPT: Checklists are UNIT TESTS FOR REQUIREMENTS WRITING - they validate the quality, clarity, and completeness of requirements in a given domain.
NOT for verification/testing:
-
❌ NOT "Verify the button clicks correctly"
-
❌ NOT "Test error handling works"
-
❌ NOT "Confirm the API returns 200"
-
❌ NOT checking if code/implementation matches the spec
FOR requirements quality validation:
-
✅ "Are visual hierarchy requirements defined for all card types?" (completeness)
-
✅ "Is 'prominent display' quantified with specific sizing/positioning?" (clarity)
-
✅ "Are hover state requirements consistent across all interactive elements?" (consistency)
-
✅ "Are accessibility requirements defined for keyboard navigation?" (coverage)
-
✅ "Does the spec define what happens when logo image fails to load?" (edge cases)
Metaphor: If your spec is code written in English, the checklist is its unit test suite. You're testing whether the requirements are well-written, complete, unambiguous, and ready for implementation - NOT whether the implementation works.
User Input
$ARGUMENTS
You MUST consider the user input before proceeding (if not empty).
Execution Steps
Setup: Run spec-kitty agent feature check-prerequisites --json from repo root and parse JSON for feature_dir and available_docs list.
-
All file paths must be absolute.
-
Treat target_branch / base_branch in the JSON payload as canonical branch context for this run.
Clarify intent (dynamic): Derive up to THREE initial contextual clarifying questions (no pre-baked catalog). They MUST:
-
Be generated from the user's phrasing + extracted signals from spec/plan/tasks
-
Only ask about information that materially changes checklist content
-
Be skipped individually if already unambiguous in $ARGUMENTS
-
Prefer precision over breadth
Generation algorithm:
-
Extract signals: feature domain keywords (e.g., auth, latency, UX, API), risk indicators ("critical", "must", "compliance"), stakeholder hints ("QA", "review", "security team"), and explicit deliverables ("a11y", "rollback", "contracts").
-
Cluster signals into candidate focus areas (max 4) ranked by relevance.
-
Identify probable audience & timing (author, reviewer, QA, release) if not explicit.
-
Detect missing dimensions: scope breadth, depth/rigor, risk emphasis, exclusion boundaries, measurable acceptance criteria.
-
Formulate questions chosen from these archetypes:
-
Scope refinement (e.g., "Should this include integration touchpoints with X and Y or stay limited to local module correctness?")
-
Risk prioritization (e.g., "Which of these potential risk areas should receive mandatory gating checks?")
-
Depth calibration (e.g., "Is this a lightweight pre-commit sanity list or a formal release gate?")
-
Audience framing (e.g., "Will this be used by the author only or peers during PR review?")
-
Boundary exclusion (e.g., "Should we explicitly exclude performance tuning items this round?")
-
Scenario class gap (e.g., "No recovery flows detected—are rollback / partial failure paths in scope?")
Question formatting rules:
-
If presenting options, generate a compact table with columns: Option | Candidate | Why It Matters
-
Limit to A–E options maximum; omit table if a free-form answer is clearer
-
Never ask the user to restate what they already said
-
Avoid speculative categories (no hallucination). If uncertain, ask explicitly: "Confirm whether X belongs in scope."
Defaults when interaction impossible:
-
Depth: Standard
-
Audience: Reviewer (PR) if code-related; Author otherwise
-
Focus: Top 2 relevance clusters
Output the questions (label Q1/Q2/Q3). After answers: if ≥2 scenario classes (Alternate / Exception / Recovery / Non-Functional domain) remain unclear, you MAY ask up to TWO more targeted follow‑ups (Q4/Q5) with a one-line justification each (e.g., "Unresolved recovery path risk"). Do not exceed five total questions. Skip escalation if user explicitly declines more.
Understand user request: Combine $ARGUMENTS
- clarifying answers:
-
Derive checklist theme (e.g., security, review, deploy, ux)
-
Consolidate explicit must-have items mentioned by user
-
Map focus selections to category scaffolding
-
Infer any missing context from spec/plan/tasks (do NOT hallucinate)
Load feature context: Read from feature_dir:
-
spec.md: Feature requirements and scope
-
plan.md (if exists): Technical details, dependencies
-
tasks.md (if exists): Implementation tasks
Context Loading Strategy:
-
Load only necessary portions relevant to active focus areas (avoid full-file dumping)
-
Prefer summarizing long sections into concise scenario/requirement bullets
-
Use progressive disclosure: add follow-on retrieval only if gaps detected
-
If source docs are large, generate interim summary items instead of embedding raw text
Generate checklist - Create "Unit Tests for Requirements":
-
Create feature_dir/checklists/ directory if it doesn't exist
-
Generate unique checklist filename:
-
Use short, descriptive name based on domain (e.g., ux.md , api.md , security.md )
-
Format: [domain].md
-
If file exists, append to existing file
-
Number items sequentially starting from CHK001
-
Each /spec-kitty.checklist run creates a NEW file (never overwrites existing checklists)
CORE PRINCIPLE - Test the Requirements, Not the Implementation: Every checklist item MUST evaluate the REQUIREMENTS THEMSELVES for:
-
Completeness: Are all necessary requirements present?
-
Clarity: Are requirements unambiguous and specific?
-
Consistency: Do requirements align with each other?
-
Measurability: Can requirements be objectively verified?
-
Coverage: Are all scenarios/edge cases addressed?
Category Structure - Group items by requirement quality dimensions:
-
Requirement Completeness (Are all necessary requirements documented?)
-
Requirement Clarity (Are requirements specific and unambiguous?)
-
Requirement Consistency (Do requirements align without conflicts?)
-
Acceptance Criteria Quality (Are success criteria measurable?)
-
Scenario Coverage (Are all flows/cases addressed?)
-
Edge Case Coverage (Are boundary conditions defined?)
-
Non-Functional Requirements (Performance, Security, Accessibility, etc. - are they specified?)
-
Dependencies & Assumptions (Are they documented and validated?)
-
Ambiguities & Conflicts (What needs clarification?)
HOW TO WRITE CHECKLIST ITEMS - "Unit Tests for English":
❌ WRONG (Testing implementation):
-
"Verify landing page displays 3 episode cards"
-
"Test hover states work on desktop"
-
"Confirm logo click navigates home"
✅ CORRECT (Testing requirements quality):
-
"Are the exact number and layout of featured episodes specified?" [Completeness]
-
"Is 'prominent display' quantified with specific sizing/positioning?" [Clarity]
-
"Are hover state requirements consistent across all interactive elements?" [Consistency]
-
"Are keyboard navigation requirements defined for all interactive UI?" [Coverage]
-
"Is the fallback behavior specified when logo image fails to load?" [Edge Cases]
-
"Are loading states defined for asynchronous episode data?" [Completeness]
-
"Does the spec define visual hierarchy for competing UI elements?" [Clarity]
ITEM STRUCTURE: Each item should follow this pattern:
-
Question format asking about requirement quality
-
Focus on what's WRITTEN (or not written) in the spec/plan
-
Include quality dimension in brackets [Completeness/Clarity/Consistency/etc.]
-
Reference spec section [Spec §X.Y] when checking existing requirements
-
Use [Gap] marker when checking for missing requirements
EXAMPLES BY QUALITY DIMENSION:
Completeness:
-
"Are error handling requirements defined for all API failure modes? [Gap]"
-
"Are accessibility requirements specified for all interactive elements? [Completeness]"
-
"Are mobile breakpoint requirements defined for responsive layouts? [Gap]"
Clarity:
-
"Is 'fast loading' quantified with specific timing thresholds? [Clarity, Spec §NFR-2]"
-
"Are 'related episodes' selection criteria explicitly defined? [Clarity, Spec §FR-5]"
-
"Is 'prominent' defined with measurable visual properties? [Ambiguity, Spec §FR-4]"
Consistency:
-
"Do navigation requirements align across all pages? [Consistency, Spec §FR-10]"
-
"Are card component requirements consistent between landing and detail pages? [Consistency]"
Coverage:
-
"Are requirements defined for zero-state scenarios (no episodes)? [Coverage, Edge Case]"
-
"Are concurrent user interaction scenarios addressed? [Coverage, Gap]"
-
"Are requirements specified for partial data loading failures? [Coverage, Exception Flow]"
Measurability:
-
"Are visual hierarchy requirements measurable/testable? [Acceptance Criteria, Spec §FR-1]"
-
"Can 'balanced visual weight' be objectively verified? [Measurability, Spec §FR-2]"
Scenario Classification & Coverage (Requirements Quality Focus):
-
Check if requirements exist for: Primary, Alternate, Exception/Error, Recovery, Non-Functional scenarios
-
For each scenario class, ask: "Are [scenario type] requirements complete, clear, and consistent?"
-
If scenario class missing: "Are [scenario type] requirements intentionally excluded or missing? [Gap]"
-
Include resilience/rollback when state mutation occurs: "Are rollback requirements defined for migration failures? [Gap]"
Traceability Requirements:
-
MINIMUM: ≥80% of items MUST include at least one traceability reference
-
Each item should reference: spec section [Spec §X.Y] , or use markers: [Gap] , [Ambiguity] , [Conflict] , [Assumption]
-
If no ID system exists: "Is a requirement & acceptance criteria ID scheme established? [Traceability]"
Surface & Resolve Issues (Requirements Quality Problems): Ask questions about the requirements themselves:
-
Ambiguities: "Is the term 'fast' quantified with specific metrics? [Ambiguity, Spec §NFR-1]"
-
Conflicts: "Do navigation requirements conflict between §FR-10 and §FR-10a? [Conflict]"
-
Assumptions: "Is the assumption of 'always available podcast API' validated? [Assumption]"
-
Dependencies: "Are external podcast API requirements documented? [Dependency, Gap]"
-
Missing definitions: "Is 'visual hierarchy' defined with measurable criteria? [Gap]"
Content Consolidation:
-
Soft cap: If raw candidate items > 40, prioritize by risk/impact
-
Merge near-duplicates checking the same requirement aspect
-
If >5 low-impact edge cases, create one item: "Are edge cases X, Y, Z addressed in requirements? [Coverage]"
🚫 ABSOLUTELY PROHIBITED - These make it an implementation test, not a requirements test:
-
❌ Any item starting with "Verify", "Test", "Confirm", "Check" + implementation behavior
-
❌ References to code execution, user actions, system behavior
-
❌ "Displays correctly", "works properly", "functions as expected"
-
❌ "Click", "navigate", "render", "load", "execute"
-
❌ Test cases, test plans, QA procedures
-
❌ Implementation details (frameworks, APIs, algorithms)
✅ REQUIRED PATTERNS - These test requirements quality:
-
✅ "Are [requirement type] defined/specified/documented for [scenario]?"
-
✅ "Is [vague term] quantified/clarified with specific criteria?"
-
✅ "Are requirements consistent between [section A] and [section B]?"
-
✅ "Can [requirement] be objectively measured/verified?"
-
✅ "Are [edge cases/scenarios] addressed in requirements?"
-
✅ "Does the spec define [missing aspect]?"
Structure Reference: Generate the checklist following the canonical template in .kittify/templates/checklist-template.md for title, meta section, category headings, and ID formatting. If template is unavailable, use: H1 title, purpose/created meta lines, ## category sections containing - [ ] CHK### <requirement item> lines with globally incrementing IDs starting at CHK001.
Report: Output full path to created checklist, item count, and remind user that each run creates a new file. Summarize:
-
Focus areas selected
-
Depth level
-
Actor/timing
-
Any explicit user-specified must-have items incorporated
Important: Each /spec-kitty.checklist command invocation creates a checklist file using short, descriptive names unless file already exists. This allows:
-
Multiple checklists of different types (e.g., ux.md , test.md , security.md )
-
Simple, memorable filenames that indicate checklist purpose
-
Easy identification and navigation in the checklists/ folder
To avoid clutter, use descriptive types and clean up obsolete checklists when done.
Example Checklist Types & Sample Items
UX Requirements Quality: ux.md
Sample items (testing the requirements, NOT the implementation):
-
"Are visual hierarchy requirements defined with measurable criteria? [Clarity, Spec §FR-1]"
-
"Is the number and positioning of UI elements explicitly specified? [Completeness, Spec §FR-1]"
-
"Are interaction state requirements (hover, focus, active) consistently defined? [Consistency]"
-
"Are accessibility requirements specified for all interactive elements? [Coverage, Gap]"
-
"Is fallback behavior defined when images fail to load? [Edge Case, Gap]"
-
"Can 'prominent display' be objectively measured? [Measurability, Spec §FR-4]"
API Requirements Quality: api.md
Sample items:
-
"Are error response formats specified for all failure scenarios? [Completeness]"
-
"Are rate limiting requirements quantified with specific thresholds? [Clarity]"
-
"Are authentication requirements consistent across all endpoints? [Consistency]"
-
"Are retry/timeout requirements defined for external dependencies? [Coverage, Gap]"
-
"Is versioning strategy documented in requirements? [Gap]"
Performance Requirements Quality: performance.md
Sample items:
-
"Are performance requirements quantified with specific metrics? [Clarity]"
-
"Are performance targets defined for all critical user journeys? [Coverage]"
-
"Are performance requirements under different load conditions specified? [Completeness]"
-
"Can performance requirements be objectively measured? [Measurability]"
-
"Are degradation requirements defined for high-load scenarios? [Edge Case, Gap]"
Security Requirements Quality: security.md
Sample items:
-
"Are authentication requirements specified for all protected resources? [Coverage]"
-
"Are data protection requirements defined for sensitive information? [Completeness]"
-
"Is the threat model documented and requirements aligned to it? [Traceability]"
-
"Are security requirements consistent with compliance obligations? [Consistency]"
-
"Are security failure/breach response requirements defined? [Gap, Exception Flow]"
Anti-Examples: What NOT To Do
❌ WRONG - These test implementation, not requirements:
- CHK001 - Verify landing page displays 3 episode cards [Spec §FR-001]
- CHK002 - Test hover states work correctly on desktop [Spec §FR-003]
- CHK003 - Confirm logo click navigates to home page [Spec §FR-010]
- CHK004 - Check that related episodes section shows 3-5 items [Spec §FR-005]
✅ CORRECT - These test requirements quality:
- CHK001 - Are the number and layout of featured episodes explicitly specified? [Completeness, Spec §FR-001]
- CHK002 - Are hover state requirements consistently defined for all interactive elements? [Consistency, Spec §FR-003]
- CHK003 - Are navigation requirements clear for all clickable brand elements? [Clarity, Spec §FR-010]
- CHK004 - Is the selection criteria for related episodes documented? [Gap, Spec §FR-005]
- CHK005 - Are loading state requirements defined for asynchronous episode data? [Gap]
- CHK006 - Can "visual hierarchy" requirements be objectively measured? [Measurability, Spec §FR-001]
Key Differences:
-
Wrong: Tests if the system works correctly
-
Correct: Tests if the requirements are written correctly
-
Wrong: Verification of behavior
-
Correct: Validation of requirement quality
-
Wrong: "Does it do X?"
-
Correct: "Is X clearly specified?"