systematic-debugging

Systematic Debugging

You are investigating a bug using a structured four-phase process. The goal is to find and fix the root cause, not just suppress the symptom.

When to Activate

Bug reports or unexpected behavior
Failing tests that shouldn't be failing
Production issues or error reports
"It works on my machine" type problems
Any situation where the cause isn't immediately obvious

Activation

After activation, print the banner (see _shared/observability.md ):

Systematic Debugging activated Trigger: [describe the symptom in your own words — e.g., "test failure in auth flow" or "unexpected 500 on endpoint"] Produces: root cause analysis, regression test, fix

Phase 1: Reproduce

Narrate: Phase 1/4: Reproducing the problem...

Before anything else, reliably reproduce the problem.

Steps

Understand the report — What's expected? What's happening instead? When did it start?
Reproduce locally — Run the exact steps that trigger the bug
Document the reproduction:

Reproduction

Steps: [exact steps] Expected: [what should happen] Actual: [what happens instead] Frequency: [always / intermittent / specific conditions] Environment: [OS, Node version, browser, etc.]

If intermittent: Identify conditions that increase frequency (load, timing, specific data)

If You Can't Reproduce

Check environment differences (versions, config, data)
Add logging to narrow down when/where it occurs
Ask the reporter for more details
Do NOT proceed to fixing without reproduction — you'll likely fix the wrong thing

If reproduction fails after reasonable effort, use error recovery (see _shared/observability.md ). AskUserQuestion with options: "Add more logging and retry / Get more info from reporter / Proceed with best guess (risky)."

Narrate: Phase 1/4: Reproducing the problem... done

Phase 2: Isolate

Narrate: Phase 2/4: Isolating the bug...

Narrow down where the bug lives.

Binary Search Strategy

Identify the boundaries — Which systems/modules are involved in the failing flow?
Add assertions at boundaries — Verify inputs and outputs at each layer
Bisect — Is the problem in the frontend or backend? In the handler or the model? In the query or the transformation?
Narrow until you find the exact location — The specific function, line, or condition

Isolation Techniques

Git bisect — If the bug is a regression, find the commit that introduced it: git bisect start , git bisect bad , git bisect good [known-good-commit]
Minimal reproduction — Strip away unrelated code until you have the smallest case that fails
Logging — Add targeted logging (not scattershot console.log everywhere)
Assertion insertion — Add assert statements at key points to catch violations

What to Record

Isolation

Location: [file:line or module] Narrowed from: [original scope] → [isolated location] Method: [how you narrowed it down]

Narrate: Phase 2/4: Isolating the bug... done

Phase 3: Analyze Root Cause

Narrate: Phase 3/4: Analyzing root cause...

Understand why the bug exists, not just where.

Root Cause Categories

Logic error — Code does the wrong thing (off-by-one, wrong operator, missing case)
State corruption — Data gets into an invalid state (race condition, missing validation, stale cache)
Integration mismatch — Two components disagree on interface, format, or protocol
Environment issue — Config, version, or platform difference
Missing handling — Edge case, error path, or boundary not covered

Analysis Questions

Why does this code exist? What was the original intent?
What changed recently? (Check git log for the area)
Are there similar patterns elsewhere that work correctly? What's different?
Is this the only place this bug manifests, or is it a systemic issue?

Log the root cause decision (see _shared/observability.md Decision Log format):

Decision: Root cause is [category] — [specific cause] Reason: [evidence from isolation phase] Alternatives: [other hypotheses that were ruled out]

Defense in Depth

After identifying the root cause, ask:

Why wasn't this caught by tests?
Why wasn't this caught by types?
Why wasn't this caught by code review?
What other similar bugs might exist?

Root Cause

Cause: [precise description] Category: [logic / state / integration / environment / missing handling] Why it wasn't caught: [gap in testing, types, or review] Related risks: [other places with similar patterns]

Narrate: Phase 3/4: Analyzing root cause... done

Phase 4: Fix

Narrate: Phase 4/4: Applying fix...

Fix the root cause and prevent regression.

Fix Process

Write a failing test that reproduces the bug (TDD red phase)
Verify the test fails — It should fail for the same reason as the bug
Apply the minimal fix — Change as little as possible
Verify the test passes — The fix resolves the bug
Run the full test suite — The fix doesn't break anything else
Check for related issues — Fix similar patterns elsewhere if applicable

Log defense-in-depth decisions:

Decision: [which defense-in-depth measures to add] Reason: [what class of bug this prevents] Alternatives: [other measures considered]

Defense-in-Depth Fixes

Beyond the immediate fix, consider:

Add type constraints that would prevent this class of bug
Add validation at the boundary where bad data enters
Add assertions that would catch this sooner
Improve error messages so the next person gets a clearer signal

Condition-Based Waiting

If the bug involves timing or async behavior:

Never use arbitrary delays (sleep 5 , setTimeout(5000) )
Wait for conditions: poll for the expected state, with a timeout
Example: Instead of sleep 2 && check_result , use i=0; while ! check_result && [ $i -lt 100 ]; do sleep 0.1; i=$((i+1)); done

Fix Report

Fix

Change: [what was changed and why] Files: [list of modified files] Test: [the regression test that was added] Defense: [additional protective measures added] Related fixes: [similar patterns fixed elsewhere, or "none"]

Narrate: Phase 4/4: Applying fix... done

Completion

Present the full debugging report:

Debugging summary. Root cause: [one-line root cause] Fix: [one-line fix description]

Bug: [one-line description] Reproduction → Isolation → Analysis → Fix [Link to each section above]

Regression test: [test name and how to run it] Tests: All passing Build: Clean

Handoff

After the completion report, print this completion marker exactly:

Debugging complete. Artifacts:

Regression test: [test file path]
Files changed: [list]
Defense-in-depth: [additions, or "none"] Compound learning: [If this revealed a pattern, it should be captured via compound-learnings] Returning to the calling workflow.

Rules

Never fix a bug you can't reproduce — you're guessing
Never suppress a symptom without understanding the root cause
Always add a regression test — bugs that come back are morale killers
Log your debugging process — it helps others (and future you) with similar bugs
Don't scatter console.log — add targeted, informative logging
Remove debugging artifacts (extra logs, temporary assertions) before shipping, unless they improve production observability
If the bug is in a dependency, document the workaround and file an upstream issue

systematic-debugging

Safety Notice

Copy this and send it to your AI assistant to learn

Systematic Debugging activated Trigger: [describe the symptom in your own words — e.g., "test failure in auth flow" or "unexpected 500 on endpoint"] Produces: root cause analysis, regression test, fix

Reproduction

Isolation

Root Cause

Fix

Source Transparency

Related Skills

executing-plans

setup-claude-md

writing-plans

git-worktrees