Root Cause Analysis
Systematic approaches for identifying the true source of problems, not just symptoms.
RCA Methods Overview
Method Best For Complexity Time
5 Whys Simple, linear problems Low 15-30 min
Fishbone Multi-factor problems Medium 30-60 min
Fault Tree Critical systems, safety High 1-4 hours
Timeline Analysis Incident investigation Medium 30-90 min
5 Whys Method
Iteratively ask "why" to drill down from symptom to root cause.
Process
Problem Statement: [Clear description of the issue] │ ▼ Why #1: [First level cause] │ ▼ Why #2: [Deeper cause] │ ▼ Why #3: [Even deeper] │ ▼ Why #4: [Getting to root] │ ▼ Why #5: [Root cause identified] │ ▼ Action: [Fix that addresses root cause]
Example: Production Outage
Problem: Website was down for 2 hours
Why 1: Why was the website down? → The application server ran out of memory and crashed.
Why 2: Why did the server run out of memory? → A memory leak in the image processing service accumulated over time.
Why 3: Why was there a memory leak? → The service wasn't releasing image buffers after processing.
Why 4: Why weren't buffers being released? → The cleanup code had a bug introduced in last week's release.
Why 5: Why wasn't the bug caught before release? → We don't have automated memory leak detection in our test suite.
Root Cause: Missing automated memory leak testing Action: Add memory profiling to CI pipeline, add cleanup tests
5 Whys Best Practices
Do Don't
Base answers on evidence Guess or assume
Stay focused on one causal chain Branch too early
Keep asking until actionable Stop at symptoms
Involve people closest to issue Assign blame
Document your reasoning Skip steps
When 5 Whys Falls Short
-
Multiple contributing factors (use Fishbone)
-
Complex system interactions (use Fault Tree)
-
Organizational/process issues (need broader analysis)
Fishbone Diagram (Ishikawa)
Visualize multiple potential causes organized by category.
Standard Categories (6 M's)
┌─────────────┐
Methods ────┤ │
│ │
Machines ─────┤ │
│ ├──── PROBLEM
Materials ─────┤ │
│ │
Measurement ────┤ │
│ │
Environment ────┤ │
│ │
People ──────┤ │
└─────────────┘
Software-Specific Categories
┌─────────────┐
Code ─────┤ │
│ │
Infrastructure ────┤ │ │ ├──── BUG/INCIDENT Dependencies ────┤ │ │ │ Configuration ───┤ │ │ │ Process ────┤ │ │ │ People ─────┤ │ └─────────────┘
Fishbone Example: API Latency Spike
┌─────────────────┐
│ │
Code ─────────────────┤ │
│ │ │
├─ N+1 query issue │ │
├─ Missing index │ API LATENCY │
└─ Sync blocking call│ SPIKE │
│ │
Infrastructure ─────────────┤ │ │ │ │ ├─ DB connection pool│ │ ├─ Network saturation│ │ └─ Insufficient RAM │ │ │ │ Dependencies ───────────────┤ │ │ │ │ ├─ External API slow │ │ ├─ Redis timeout │ │ └─ CDN cache miss │ │ └─────────────────┘
Fishbone Process
-
Define the problem clearly (the fish head)
-
Identify major categories (the bones)
-
Brainstorm causes for each category
-
Analyze relationships between causes
-
Prioritize most likely root causes
-
Verify with data/testing
-
Take action on confirmed causes
Fault Tree Analysis (FTA)
Top-down, deductive analysis for critical systems.
FTA Symbols
┌─────┐ │ TOP │ Top Event (the failure being analyzed) └──┬──┘ │ ┌──┴──┐ │ AND │ All inputs must occur for output └─────┘
┌──┴──┐ │ OR │ Any input causes output └─────┘
┌─────┐ │ ○ │ Basic Event (root cause) └─────┘
┌─────┐ │ ◇ │ Undeveloped Event (needs more analysis) └─────┘
FTA Example: Authentication Failure
┌────────────────────┐
│ USER CANNOT │
│ AUTHENTICATE │
└─────────┬──────────┘
│
┌───┴───┐
│ OR │
└───┬───┘
┌──────────────────┼──────────────────┐
│ │ │
┌──────┴──────┐ ┌──────┴──────┐ ┌──────┴──────┐
│ Invalid │ │ Auth │ │ Account │
│ Credentials│ │ Service │ │ Locked │
│ │ │ Down │ │ │
└──────┬──────┘ └──────┬──────┘ └─────────────┘
│ │
┌───┴───┐ ┌───┴───┐
│ OR │ │ OR │
└───┬───┘ └───┬───┘
┌──────┼──────┐ ┌──────┼──────┐
│ │ │ │ │ │
○ ○ ○ ○ ○ ◇ Wrong Expired Token DB Redis External Password Token Invalid Down Down Auth
When to Use FTA
-
Safety-critical systems
-
Complex failure modes
-
Need to identify all paths to failure
-
Regulatory compliance requirements
-
Post-incident analysis for serious outages
Timeline Analysis
Reconstruct sequence of events to identify causation.
Timeline Template
Incident Timeline: [Incident Name]
Summary
- Incident Start: [Timestamp]
- Incident Detected: [Timestamp]
- Incident Resolved: [Timestamp]
- Total Duration: [X hours Y minutes]
- Time to Detect: [X minutes]
- Time to Resolve: [X hours Y minutes]
Detailed Timeline
| Time (UTC) | Event | Source | Actor |
|---|---|---|---|
| 14:00 | Deployment started | CI/CD | automated |
| 14:05 | Deployment completed | CI/CD | automated |
| 14:15 | Error rate increased 10x | Monitoring | - |
| 14:22 | Alert fired | PagerDuty | - |
| 14:25 | On-call acknowledged | PagerDuty | @alice |
| 14:30 | Root cause identified | Investigation | @alice |
| 14:35 | Rollback initiated | Manual | @alice |
| 14:40 | Services recovered | Monitoring | - |
| 14:45 | Incident resolved | Manual | @alice |
Analysis
Contributing Factors:
- [Factor 1]
- [Factor 2]
What Went Well:
- [Positive observation]
What Could Improve:
- [Improvement area]
Action Items
| Action | Owner | Due Date | Status |
|---|---|---|---|
Debugging Decision Tree
Problem Reported
│
▼
Can you reproduce it?
│ │
Yes No
│ │
▼ ▼
Isolate the Gather more
conditions information
│ │
▼ ▼
Recent changes? Check logs,
│ monitoring
Yes │
│ │
▼ ▼
Review diffs Correlation
& deploys analysis
│ │
└─────┬─────┘
│
▼
Form hypothesis
│
▼
Test hypothesis
│
┌─────┴─────┐
│ │
Confirmed Rejected
│ │
▼ ▼
Fix and Next hypothesis
verify
RCA Documentation Template
Root Cause Analysis: [Issue Title]
Issue Summary
Reported: [Date] Severity: P0 / P1 / P2 / P3 Impact: [Description of impact]
Problem Statement
[Clear, specific description of what went wrong]
Investigation
Timeline
[Key events in sequence]
Analysis Method Used
[ ] 5 Whys [ ] Fishbone [ ] Fault Tree [ ] Timeline Analysis
Findings
[Detailed analysis results]
Root Cause(s)
- Primary: [Main root cause]
- Contributing: [Secondary factors]
Immediate Fix
[What was done to resolve the immediate issue]
Preventive Actions
| Action | Owner | Due | Status |
|---|---|---|---|
Lessons Learned
- [Key takeaway]
- [Process improvement]
Appendix
- [Links to logs, graphs, related tickets]
Best Practices
-
Blameless postmortems: Focus on systems, not individuals
-
Automated correlation: Use AI to correlate signals across systems
-
Proactive RCA: Analyze near-misses, not just incidents
-
Knowledge sharing: Document and share RCA findings
-
Metrics-driven: Track time-to-detect, time-to-resolve trends
Related Skills
-
observability-monitoring
-
Gathering data for RCA
-
errors
-
Error pattern analysis
-
resilience-patterns
-
Preventing future incidents
References
-
5 Whys Workshop Guide
-
Fishbone Template
Version: 1.0.0 (January )