root-cause-analysis

Systematic approaches for identifying the true source of problems, not just symptoms.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "root-cause-analysis" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-root-cause-analysis

Root Cause Analysis

Systematic approaches for identifying the true source of problems, not just symptoms.

RCA Methods Overview

Method Best For Complexity Time

5 Whys Simple, linear problems Low 15-30 min

Fishbone Multi-factor problems Medium 30-60 min

Fault Tree Critical systems, safety High 1-4 hours

Timeline Analysis Incident investigation Medium 30-90 min

5 Whys Method

Iteratively ask "why" to drill down from symptom to root cause.

Process

Problem Statement: [Clear description of the issue] │ ▼ Why #1: [First level cause] │ ▼ Why #2: [Deeper cause] │ ▼ Why #3: [Even deeper] │ ▼ Why #4: [Getting to root] │ ▼ Why #5: [Root cause identified] │ ▼ Action: [Fix that addresses root cause]

Example: Production Outage

Problem: Website was down for 2 hours

Why 1: Why was the website down? → The application server ran out of memory and crashed.

Why 2: Why did the server run out of memory? → A memory leak in the image processing service accumulated over time.

Why 3: Why was there a memory leak? → The service wasn't releasing image buffers after processing.

Why 4: Why weren't buffers being released? → The cleanup code had a bug introduced in last week's release.

Why 5: Why wasn't the bug caught before release? → We don't have automated memory leak detection in our test suite.

Root Cause: Missing automated memory leak testing Action: Add memory profiling to CI pipeline, add cleanup tests

5 Whys Best Practices

Do Don't

Base answers on evidence Guess or assume

Stay focused on one causal chain Branch too early

Keep asking until actionable Stop at symptoms

Involve people closest to issue Assign blame

Document your reasoning Skip steps

When 5 Whys Falls Short

  • Multiple contributing factors (use Fishbone)

  • Complex system interactions (use Fault Tree)

  • Organizational/process issues (need broader analysis)

Fishbone Diagram (Ishikawa)

Visualize multiple potential causes organized by category.

Standard Categories (6 M's)

                ┌─────────────┐
    Methods ────┤             │
                │             │
  Machines ─────┤             │
                │             ├──── PROBLEM
 Materials ─────┤             │
                │             │
Measurement ────┤             │
                │             │
Environment ────┤             │
                │             │
   People ──────┤             │
                └─────────────┘

Software-Specific Categories

                ┌─────────────┐
      Code ─────┤             │
                │             │

Infrastructure ────┤ │ │ ├──── BUG/INCIDENT Dependencies ────┤ │ │ │ Configuration ───┤ │ │ │ Process ────┤ │ │ │ People ─────┤ │ └─────────────┘

Fishbone Example: API Latency Spike

                          ┌─────────────────┐
                          │                 │
    Code ─────────────────┤                 │
     │                    │                 │
     ├─ N+1 query issue   │                 │
     ├─ Missing index     │   API LATENCY   │
     └─ Sync blocking call│      SPIKE      │
                          │                 │

Infrastructure ─────────────┤ │ │ │ │ ├─ DB connection pool│ │ ├─ Network saturation│ │ └─ Insufficient RAM │ │ │ │ Dependencies ───────────────┤ │ │ │ │ ├─ External API slow │ │ ├─ Redis timeout │ │ └─ CDN cache miss │ │ └─────────────────┘

Fishbone Process

  • Define the problem clearly (the fish head)

  • Identify major categories (the bones)

  • Brainstorm causes for each category

  • Analyze relationships between causes

  • Prioritize most likely root causes

  • Verify with data/testing

  • Take action on confirmed causes

Fault Tree Analysis (FTA)

Top-down, deductive analysis for critical systems.

FTA Symbols

┌─────┐ │ TOP │ Top Event (the failure being analyzed) └──┬──┘ │ ┌──┴──┐ │ AND │ All inputs must occur for output └─────┘

┌──┴──┐ │ OR │ Any input causes output └─────┘

┌─────┐ │ ○ │ Basic Event (root cause) └─────┘

┌─────┐ │ ◇ │ Undeveloped Event (needs more analysis) └─────┘

FTA Example: Authentication Failure

                ┌────────────────────┐
                │   USER CANNOT      │
                │   AUTHENTICATE     │
                └─────────┬──────────┘
                          │
                      ┌───┴───┐
                      │  OR   │
                      └───┬───┘
       ┌──────────────────┼──────────────────┐
       │                  │                  │
┌──────┴──────┐    ┌──────┴──────┐    ┌──────┴──────┐
│  Invalid    │    │   Auth      │    │  Account    │
│  Credentials│    │   Service   │    │  Locked     │
│             │    │   Down      │    │             │
└──────┬──────┘    └──────┬──────┘    └─────────────┘
       │                  │
   ┌───┴───┐          ┌───┴───┐
   │  OR   │          │  OR   │
   └───┬───┘          └───┬───┘
┌──────┼──────┐    ┌──────┼──────┐
│      │      │    │      │      │

○ ○ ○ ○ ○ ◇ Wrong Expired Token DB Redis External Password Token Invalid Down Down Auth

When to Use FTA

  • Safety-critical systems

  • Complex failure modes

  • Need to identify all paths to failure

  • Regulatory compliance requirements

  • Post-incident analysis for serious outages

Timeline Analysis

Reconstruct sequence of events to identify causation.

Timeline Template

Incident Timeline: [Incident Name]

Summary

  • Incident Start: [Timestamp]
  • Incident Detected: [Timestamp]
  • Incident Resolved: [Timestamp]
  • Total Duration: [X hours Y minutes]
  • Time to Detect: [X minutes]
  • Time to Resolve: [X hours Y minutes]

Detailed Timeline

Time (UTC)EventSourceActor
14:00Deployment startedCI/CDautomated
14:05Deployment completedCI/CDautomated
14:15Error rate increased 10xMonitoring-
14:22Alert firedPagerDuty-
14:25On-call acknowledgedPagerDuty@alice
14:30Root cause identifiedInvestigation@alice
14:35Rollback initiatedManual@alice
14:40Services recoveredMonitoring-
14:45Incident resolvedManual@alice

Analysis

Contributing Factors:

  1. [Factor 1]
  2. [Factor 2]

What Went Well:

  1. [Positive observation]

What Could Improve:

  1. [Improvement area]

Action Items

ActionOwnerDue DateStatus

Debugging Decision Tree

                Problem Reported
                      │
                      ▼
           Can you reproduce it?
                │           │
               Yes          No
                │           │
                ▼           ▼
        Isolate the      Gather more
        conditions       information
                │           │
                ▼           ▼
        Recent changes?  Check logs,
                │        monitoring
               Yes          │
                │           │
                ▼           ▼
        Review diffs    Correlation
        & deploys       analysis
                │           │
                └─────┬─────┘
                      │
                      ▼
               Form hypothesis
                      │
                      ▼
                Test hypothesis
                      │
                ┌─────┴─────┐
                │           │
           Confirmed     Rejected
                │           │
                ▼           ▼
           Fix and      Next hypothesis
           verify

RCA Documentation Template

Root Cause Analysis: [Issue Title]

Issue Summary

Reported: [Date] Severity: P0 / P1 / P2 / P3 Impact: [Description of impact]

Problem Statement

[Clear, specific description of what went wrong]

Investigation

Timeline

[Key events in sequence]

Analysis Method Used

[ ] 5 Whys [ ] Fishbone [ ] Fault Tree [ ] Timeline Analysis

Findings

[Detailed analysis results]

Root Cause(s)

  1. Primary: [Main root cause]
  2. Contributing: [Secondary factors]

Immediate Fix

[What was done to resolve the immediate issue]

Preventive Actions

ActionOwnerDueStatus

Lessons Learned

  1. [Key takeaway]
  2. [Process improvement]

Appendix

  • [Links to logs, graphs, related tickets]

Best Practices

  • Blameless postmortems: Focus on systems, not individuals

  • Automated correlation: Use AI to correlate signals across systems

  • Proactive RCA: Analyze near-misses, not just incidents

  • Knowledge sharing: Document and share RCA findings

  • Metrics-driven: Track time-to-detect, time-to-resolve trends

Related Skills

  • observability-monitoring

  • Gathering data for RCA

  • errors

  • Error pattern analysis

  • resilience-patterns

  • Preventing future incidents

References

  • 5 Whys Workshop Guide

  • Fishbone Template

Version: 1.0.0 (January )

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

web-research-workflow

No summary provided by upstream source.

Repository SourceNeeds Review
Research

user-research

No summary provided by upstream source.

Repository SourceNeeds Review
Research

competitive-analysis

No summary provided by upstream source.

Repository SourceNeeds Review