Systematic Debugging
You are investigating a bug using a structured four-phase process. The goal is to find and fix the root cause, not just suppress the symptom.
When to Activate
-
Bug reports or unexpected behavior
-
Failing tests that shouldn't be failing
-
Production issues or error reports
-
"It works on my machine" type problems
-
Any situation where the cause isn't immediately obvious
Activation
After activation, print the banner (see _shared/observability.md ):
Systematic Debugging activated Trigger: [describe the symptom in your own words — e.g., "test failure in auth flow" or "unexpected 500 on endpoint"] Produces: root cause analysis, regression test, fix
Phase 1: Reproduce
Narrate: Phase 1/4: Reproducing the problem...
Before anything else, reliably reproduce the problem.
Steps
-
Understand the report — What's expected? What's happening instead? When did it start?
-
Reproduce locally — Run the exact steps that trigger the bug
-
Document the reproduction:
Reproduction
Steps: [exact steps] Expected: [what should happen] Actual: [what happens instead] Frequency: [always / intermittent / specific conditions] Environment: [OS, Node version, browser, etc.]
- If intermittent: Identify conditions that increase frequency (load, timing, specific data)
If You Can't Reproduce
-
Check environment differences (versions, config, data)
-
Add logging to narrow down when/where it occurs
-
Ask the reporter for more details
-
Do NOT proceed to fixing without reproduction — you'll likely fix the wrong thing
If reproduction fails after reasonable effort, use error recovery (see _shared/observability.md ). AskUserQuestion with options: "Add more logging and retry / Get more info from reporter / Proceed with best guess (risky)."
Narrate: Phase 1/4: Reproducing the problem... done
Phase 2: Isolate
Narrate: Phase 2/4: Isolating the bug...
Narrow down where the bug lives.
Binary Search Strategy
-
Identify the boundaries — Which systems/modules are involved in the failing flow?
-
Add assertions at boundaries — Verify inputs and outputs at each layer
-
Bisect — Is the problem in the frontend or backend? In the handler or the model? In the query or the transformation?
-
Narrow until you find the exact location — The specific function, line, or condition
Isolation Techniques
-
Git bisect — If the bug is a regression, find the commit that introduced it: git bisect start , git bisect bad , git bisect good [known-good-commit]
-
Minimal reproduction — Strip away unrelated code until you have the smallest case that fails
-
Logging — Add targeted logging (not scattershot console.log everywhere)
-
Assertion insertion — Add assert statements at key points to catch violations
What to Record
Isolation
Location: [file:line or module] Narrowed from: [original scope] → [isolated location] Method: [how you narrowed it down]
Narrate: Phase 2/4: Isolating the bug... done
Phase 3: Analyze Root Cause
Narrate: Phase 3/4: Analyzing root cause...
Understand why the bug exists, not just where.
Root Cause Categories
-
Logic error — Code does the wrong thing (off-by-one, wrong operator, missing case)
-
State corruption — Data gets into an invalid state (race condition, missing validation, stale cache)
-
Integration mismatch — Two components disagree on interface, format, or protocol
-
Environment issue — Config, version, or platform difference
-
Missing handling — Edge case, error path, or boundary not covered
Analysis Questions
-
Why does this code exist? What was the original intent?
-
What changed recently? (Check git log for the area)
-
Are there similar patterns elsewhere that work correctly? What's different?
-
Is this the only place this bug manifests, or is it a systemic issue?
Log the root cause decision (see _shared/observability.md Decision Log format):
Decision: Root cause is [category] — [specific cause] Reason: [evidence from isolation phase] Alternatives: [other hypotheses that were ruled out]
Defense in Depth
After identifying the root cause, ask:
-
Why wasn't this caught by tests?
-
Why wasn't this caught by types?
-
Why wasn't this caught by code review?
-
What other similar bugs might exist?
Root Cause
Cause: [precise description] Category: [logic / state / integration / environment / missing handling] Why it wasn't caught: [gap in testing, types, or review] Related risks: [other places with similar patterns]
Narrate: Phase 3/4: Analyzing root cause... done
Phase 4: Fix
Narrate: Phase 4/4: Applying fix...
Fix the root cause and prevent regression.
Fix Process
-
Write a failing test that reproduces the bug (TDD red phase)
-
Verify the test fails — It should fail for the same reason as the bug
-
Apply the minimal fix — Change as little as possible
-
Verify the test passes — The fix resolves the bug
-
Run the full test suite — The fix doesn't break anything else
-
Check for related issues — Fix similar patterns elsewhere if applicable
Log defense-in-depth decisions:
Decision: [which defense-in-depth measures to add] Reason: [what class of bug this prevents] Alternatives: [other measures considered]
Defense-in-Depth Fixes
Beyond the immediate fix, consider:
-
Add type constraints that would prevent this class of bug
-
Add validation at the boundary where bad data enters
-
Add assertions that would catch this sooner
-
Improve error messages so the next person gets a clearer signal
Condition-Based Waiting
If the bug involves timing or async behavior:
-
Never use arbitrary delays (sleep 5 , setTimeout(5000) )
-
Wait for conditions: poll for the expected state, with a timeout
-
Example: Instead of sleep 2 && check_result , use i=0; while ! check_result && [ $i -lt 100 ]; do sleep 0.1; i=$((i+1)); done
Fix Report
Fix
Change: [what was changed and why] Files: [list of modified files] Test: [the regression test that was added] Defense: [additional protective measures added] Related fixes: [similar patterns fixed elsewhere, or "none"]
Narrate: Phase 4/4: Applying fix... done
Completion
Present the full debugging report:
Debugging summary. Root cause: [one-line root cause] Fix: [one-line fix description]
Bug: [one-line description] Reproduction → Isolation → Analysis → Fix [Link to each section above]
Regression test: [test name and how to run it] Tests: All passing Build: Clean
Handoff
After the completion report, print this completion marker exactly:
Debugging complete. Artifacts:
- Regression test: [test file path]
- Files changed: [list]
- Defense-in-depth: [additions, or "none"] Compound learning: [If this revealed a pattern, it should be captured via compound-learnings] Returning to the calling workflow.
Rules
-
Never fix a bug you can't reproduce — you're guessing
-
Never suppress a symptom without understanding the root cause
-
Always add a regression test — bugs that come back are morale killers
-
Log your debugging process — it helps others (and future you) with similar bugs
-
Don't scatter console.log — add targeted, informative logging
-
Remove debugging artifacts (extra logs, temporary assertions) before shipping, unless they improve production observability
-
If the bug is in a dependency, document the workaround and file an upstream issue