Investigate — Root Cause Analysis Engine
Systematic deep investigation protocol. Finds the REAL cause, not the surface symptom.
Core principle: Never fix what you don't understand. Every fix must trace to a proven root cause with evidence.
Protocol
Process every /investigate invocation through these 8 phases in strict order. Never skip a phase. Never jump to Phase 7 (FIX) without completing Phases 1-6.
Phase 1: OBSERVE — Gather All Symptoms
Collect every observable fact before forming any theory.
-
Parse $ARGUMENTS as the symptom description
-
Ask the user for additional context if the description is vague — use AskUserQuestion:
-
What's the expected behavior vs actual behavior?
-
When did it start? What changed recently?
-
Is it consistent or intermittent?
-
Any error messages, logs, or stack traces?
-
Check memory files for known pitfalls related to this area:
-
Read MEMORY.md and any topic-specific memory files
-
Check CLAUDE.md for documented patterns
-
Gather environmental context:
-
Run git log --oneline -20 to see recent changes
-
Run git diff --stat HEAD~5 to see what files changed recently
-
Check for any failing tests with the project's test runner
Output: A symptom report listing every observable fact, recent changes, and any relevant memory entries.
Gate: Do NOT theorize yet. Only facts.
Phase 2: REPRODUCE — Confirm the Issue
An issue you cannot reproduce is an issue you cannot prove you fixed.
-
Identify the shortest path to trigger the symptom:
-
Run existing tests that cover the affected area
-
If no test exists, attempt manual reproduction via Bash
-
For UI issues, check if a Playwright MCP sequence can reproduce it
-
Document the reproduction steps precisely
-
If the issue is intermittent:
-
Flag it as potentially timing-dependent (race condition, async, state)
-
Look for concurrent access, shared mutable state, missing locks/guards
-
Check for dependency on external state (network, filesystem, database)
-
If the issue cannot be reproduced:
-
Shift to forensic investigation (logs, git history, code review)
-
Do NOT skip remaining phases — proceed with available evidence
Output: Reproduction steps, or explicit documentation of why reproduction failed.
Gate: Issue confirmed (or forensic mode declared). Proceed.
Phase 3: TRACE — Follow the Execution Path
Start from the symptom and trace backward to the origin.
-
Locate the symptom — find the exact file and line where the error occurs:
-
Use Grep for error messages, exception types, log strings
-
Use Explore agent for broad searches if the location is unclear
-
Trace the call chain — read every file in the execution path:
-
From error site → caller → caller's caller → entry point
-
Read each file fully with Read tool — do NOT skim
-
Document the complete flow: input → transform → output
-
Trace the data flow — follow the data that caused the error:
-
What value caused the crash? Where did it come from?
-
Trace the value backward: variable → assignment → source → input
-
Map dependencies — what else touches this code path:
-
Use Grep to find all callers of the failing function
-
Check for shared state, singletons, global variables
-
Look for recent changes in dependencies with git log --oneline -- <file>
-
Check git forensics — when was the problem introduced:
-
git log --oneline -- <affected-files> — who changed these files and when?
-
git blame <file> on the suspicious lines — what commit introduced them?
-
If a clear suspect commit is found, read its full diff
Output: Complete execution trace with file paths and line numbers. Data flow map. Git timeline.
Gate: The full code path from entry point to symptom is mapped and understood.
Phase 4: HYPOTHESIZE — Deep Reasoning with 5 Whys
This phase MUST use the sequential-thinking MCP server for structured multi-step reasoning.
-
Start the sequential-thinking chain with the symptom and all evidence from Phases 1-3
-
Apply the 5 Whys method — for each answer, ask "but why does THAT happen?": Symptom: App crashes when tapping a document Why 1: DocumentDetailView accesses a deleted NSManagedObject Why 2: The object was deleted from Core Data while the view held a reference Why 3: context.delete() was called from a background operation Why 4: The background sync didn't check if the view was still displaying the object Why 5: There's no soft-delete pattern — objects are hard-deleted immediately ROOT CAUSE: Missing soft-delete guard in the sync pipeline
-
Generate at least 2 competing hypotheses — don't lock on the first theory:
-
Categorize each by type: Code Logic | Data State | Timing/Race | Environment | Dependency | Configuration
-
For each hypothesis, define what evidence would prove or disprove it
-
Use branching in sequential-thinking to explore alternative explanations: branchFromThought: 3, branchId: "alternative-cause"
-
Rank hypotheses by likelihood based on available evidence
Output: Ranked list of hypotheses with evidence requirements for each.
Gate: At least 2 hypotheses generated. Each has defined proof criteria.
Phase 5: PROVE — Test Each Hypothesis with Evidence
Systematically confirm or eliminate each hypothesis. No guessing.
For each hypothesis (highest-ranked first):
-
Gather confirming evidence:
-
Read the specific code paths predicted by the hypothesis
-
Check logs/output for patterns the hypothesis predicts
-
Run targeted tests that would pass if the hypothesis is correct
-
Use git blame / git log to check if timing matches
-
Gather disconfirming evidence:
-
Look for code paths that should also fail if the hypothesis is correct but don't
-
Check edge cases that contradict the hypothesis
-
Check external sources:
-
Use WebSearch for known issues in the library/framework version
-
Use library-docs skill (context7 MCP) to verify correct API usage
-
Search GitHub issues for the library: mcp__github__search_issues
-
Verdict per hypothesis:
-
CONFIRMED — evidence supports it, no contradictions
-
ELIMINATED — evidence contradicts it
-
INCONCLUSIVE — need more evidence (define what)
If all hypotheses are eliminated: Return to Phase 4 with new evidence. Generate new hypotheses.
Output: Evidence log per hypothesis. One confirmed root cause (or request for more data).
Gate: Exactly one root cause confirmed with evidence. Or an explicit statement that the cause requires additional data from the user (with specific questions).
Phase 6: ROOT CAUSE — Document the Causal Chain
Write the definitive explanation before touching any code.
-
Document the complete causal chain: ROOT CAUSE: <the deepest systemic issue> → causes: <intermediate effect> → causes: <intermediate effect> → manifests as: <the symptom the user reported>
-
Explain why this is the root cause (not just a proximate cause):
-
If fixed, would it prevent recurrence? (yes = root cause)
-
Is there a deeper cause? (if yes, keep digging)
-
Identify the blast radius — what else is affected:
-
Are there similar patterns elsewhere in the codebase?
-
Use Grep to find analogous code that may have the same bug
-
Present the root cause analysis to the user before proceeding to fix
Output: Root cause statement, causal chain, blast radius assessment.
Gate: User understands and agrees with the diagnosis before any fix is attempted.
Phase 7: FIX — Address the Root Cause
Fix the root cause, not the symptom. Minimal, targeted change.
-
Design the fix:
-
What is the minimum change that eliminates the root cause?
-
Does the fix handle all cases in the blast radius (Phase 6)?
-
Does the fix introduce any new risks?
-
Implement the fix:
-
Read every file before modifying it
-
Make the smallest change possible
-
Add inline comments only where the fix is non-obvious
-
Verify the fix:
-
Run the reproduction steps from Phase 2 — symptom should be gone
-
Run existing tests — no regressions
-
Run code-quality agent on modified files if the change is substantial
-
Check for similar patterns:
-
If the bug was a pattern (e.g., missing null check), search for the same pattern elsewhere
-
Fix all instances, not just the reported one
Output: Code changes with explanation of what was changed and why.
Phase 8: PREVENT — Ensure It Never Recurs
The investigation isn't complete until recurrence is prevented.
-
Add a regression test that would have caught this bug:
-
The test must fail without the fix and pass with it
-
Use test-automation agent for comprehensive test generation
-
Update project memory if a new pitfall was discovered:
-
Add to MEMORY.md under Common Pitfalls
-
Include the pattern, why it's dangerous, and the safe alternative
-
Suggest structural improvements (optional, only if the bug reveals a design flaw):
-
Propose architectural changes that make this class of bug impossible
-
Present as a suggestion, not an immediate action
-
Write the investigation summary:
Investigation Report
Symptom: <what was reported> Root Cause: <the deepest systemic issue> Causal Chain: root cause → ... → symptom Fix: <what was changed, which files> Blast Radius: <other areas checked/fixed> Regression Test: <test added> Prevention: <memory updated, guard added, pattern documented> Time: <phases completed, hypotheses tested>
Tool Usage by Phase
Phase Primary Tools When to Use Agents
-
OBSERVE Read, Grep, Bash (git log) —
-
REPRODUCE Bash (test runner), Playwright MCP —
-
TRACE Read, Grep, Glob, Bash (git blame) Explore agent for broad searches
-
HYPOTHESIZE sequential-thinking MCP deep-analysis skill
-
PROVE Read, Grep, Bash, WebSearch, context7 MCP library-docs skill, GitHub MCP
-
ROOT CAUSE Read, Grep Explore agent for blast radius
-
FIX Read, Edit, Write, Bash code-quality agent for review
-
PREVENT Write, Edit, Bash test-automation agent for tests
Anti-Patterns — What This Skill Prevents
Bad Habit What /investigate Does Instead
Jump straight to fixing Forces Phases 1-6 before any code change
Fix the symptom 5 Whys drills to root cause
Single theory tunnel vision Requires 2+ competing hypotheses
"It works now" without understanding Demands evidence-based proof
Fix one instance, miss others Blast radius analysis in Phase 6
No regression test Phase 8 mandates a test
Knowledge lost Memory update in Phase 8
When to Use /investigate vs Other Tools
Situation Use
Bug, crash, error, unexpected behavior /investigate
Build a new feature /execute
Quick "what does this code do?" Explore agent directly
Performance slow but unclear why /investigate (treat slowness as symptom)
Known fix, just need to apply it Direct Edit — no investigation needed
Security vulnerability found /investigate
- security-scan
References
See references/investigation-frameworks.md for detailed methodology guides.