GameDay Planning
Comprehensive guide for planning and executing GameDay exercises - organized chaos drills that test system resilience and incident response.
When to Use This Skill
-
Planning GameDay exercises
-
Designing failure scenarios
-
Preparing teams for chaos experiments
-
Running disaster recovery drills
-
Improving incident response readiness
What is a GameDay?
GameDay = Planned chaos exercise for your systems
Like a fire drill, but for infrastructure:
- Scheduled in advance
- Controlled environment
- Practice for real incidents
- Learn and improve
Not chaos engineering:
- GameDay: Scheduled team exercise
- Chaos engineering: Continuous experiments
GameDays include:
- Failure injection
- Incident response practice
- Team coordination
- Runbook validation
GameDay Types
By Scope
-
Component GameDay └── Single service or component └── Focused scenarios └── 2-4 hours
-
Service GameDay └── Multiple related services └── Integration scenarios └── Half day
-
Full System GameDay └── Complete system └── Disaster scenarios └── Full day
-
Cross-Team GameDay └── Multiple teams involved └── Complex scenarios └── 1-2 days
By Objective
-
Resilience validation └── Does the system handle failures?
-
Recovery practice └── Can we restore from backup?
-
Incident response training └── How well do we coordinate?
-
Runbook validation └── Do our runbooks work?
-
Capacity testing └── What happens under load?
Planning Phase
Timeline Overview
Week -4: Initial planning ├── Define objectives ├── Identify stakeholders └── Draft scenario ideas
Week -3: Scenario design ├── Detail failure scenarios ├── Define success criteria └── Identify risks
Week -2: Preparation ├── Review with stakeholders ├── Prepare monitoring ├── Update runbooks └── Brief participants
Week -1: Final prep ├── Confirm participants ├── Test monitoring ├── Walkthrough scenarios └── Prepare rollback plans
Day of: Execute ├── Pre-GameDay briefing ├── Run scenarios ├── Document observations └── Hot debrief
Objective Setting
Good objectives:
- "Validate failover to secondary region works < 5 minutes"
- "Confirm team can diagnose database issues using runbooks"
- "Test load balancer behavior when 50% of nodes fail"
Bad objectives:
- "See what breaks" (too vague)
- "Test everything" (too broad)
- "Find all bugs" (unrealistic)
SMART objectives: Specific: Clear scenario Measurable: Defined success criteria Achievable: Within team capability Relevant: Tests real risks Time-bound: Fits in GameDay
Scenario Design
Scenario template:
Name: [Descriptive name] Type: [Infrastructure/Application/Data/Process] Duration: [Expected time]
Objective: What are we testing?
Hypothesis: "When [fault], the system will [expected behavior]"
Setup:
- [Pre-condition 1]
- [Pre-condition 2]
Execution:
- [Injection step 1]
- [Injection step 2]
Expected Outcome:
- [Metric] should [behavior]
- [Alert] should [fire/not fire]
- [Recovery] should [happen]
Success Criteria: □ [Criterion 1] □ [Criterion 2]
Abort Conditions:
- [Condition] → Stop immediately
- [Condition] → Pause and assess
Rollback Steps:
- [Rollback step 1]
- [Rollback step 2]
Common Scenarios
Infrastructure: □ Kill primary database instance □ Network partition between zones □ Full disk on critical service □ Memory exhaustion □ Certificate expiration
Application: □ Deploy bad configuration □ Overwhelm with traffic □ Corrupt cache entries □ Exhaust connection pool □ API dependency failure
Data: □ Restore from backup □ Data corruption detection □ Replication lag □ Schema migration failure
Process: □ Key team member unavailable □ Credentials rotation □ Access revocation □ Runbook-only resolution
Preparation Phase
Stakeholder Communication
Communication plan:
Leadership:
- What: GameDay overview, risks, benefits
- When: Week -3 (approval)
- How: Meeting + document
Participating teams:
- What: Detailed plan, roles, expectations
- When: Week -2 (kickoff)
- How: Meeting + documentation
Adjacent teams:
- What: Notification, potential impact
- When: Week -1
- How: Email + calendar block
On-call:
- What: Extra vigilance, escalation paths
- When: Day before
- How: Briefing + runbook
Participant Briefing
Briefing contents:
-
Objectives What are we testing and why?
-
Roles Who does what during GameDay?
-
Schedule Timeline and scenario order
-
Ground rules What's allowed, what's not
-
Safety Kill switches, abort conditions
-
Communication Channels, updates, escalation
-
Questions Clear up any confusion
Monitoring Preparation
Before GameDay:
-
Verify dashboards work
- All relevant metrics visible
- Baselines understood
-
Configure extra alerting
- GameDay-specific alerts
- Lower thresholds if needed
-
Prepare queries
- Log queries ready
- Trace searches prepared
-
Test recording
- Screen recording if needed
- Metrics export configured
-
Clear noise
- Suppress known alerts
- Reduce background chatter
Safety Measures
Required safety measures:
Kill switches:
- Immediate stop for each scenario
- Multiple people can trigger
- Tested before GameDay
Blast radius limits:
- Maximum affected users/traffic
- Automatic enforcement
- Clear escalation if exceeded
Rollback plans:
- Documented for each scenario
- Tested rollback procedures
- Time-limited scenarios
Communication:
- Dedicated channel
- Clear "STOP" command
- Status page ready to update
Customer protection:
- Synthetic traffic if possible
- Canary approach
- Quick customer comm ready
Execution Phase
Day-of Structure
Typical GameDay schedule:
08:00 - Pre-GameDay briefing └── Review objectives, roles, safety
08:30 - Monitoring baseline └── Capture normal state
09:00 - Scenario 1 └── Execute, observe, document
10:30 - Break + quick debrief
11:00 - Scenario 2 └── Execute, observe, document
12:30 - Lunch break
13:30 - Scenario 3 └── Execute, observe, document
15:00 - Scenario 4 (if time)
16:00 - Hot debrief └── Initial observations
16:30 - Cleanup └── Ensure all reverted
Roles During Execution
GameDay Lead:
- Runs the overall exercise
- Makes go/no-go decisions
- Controls pacing
- Manages safety
Scenario Executor:
- Injects faults
- Monitors injection
- Has kill switch
- Reports status
Observers:
- Watch system behavior
- Document findings
- Note unexpected events
- Track metrics
Incident Responders:
- Act as if real incident
- Follow runbooks
- Practice coordination
- Don't know scenarios in advance (optional)
Scribe:
- Records timeline
- Documents decisions
- Captures quotes
- Notes action items
Documentation During
Timeline template:
[TIME] [ACTOR] [ACTION/OBSERVATION]
09:00 GameDay Lead: Starting Scenario 1 - DB failover 09:01 Executor: Triggered primary DB shutdown 09:02 Observer: Alert fired: DB connection errors 09:03 Observer: Failover initiated automatically 09:05 Observer: Secondary promoted to primary 09:07 Responder: Services reconnected 09:10 Observer: Error rate returning to normal 09:12 GameDay Lead: Scenario 1 complete - success
Capture:
- Exact times
- Who did what
- System responses
- Deviations from expected
- Interesting observations
Handling Real Incidents
If real incident occurs during GameDay:
-
STOP GameDay immediately "GameDay paused - real incident"
-
Assess the real incident Is it related to GameDay?
-
Revert any GameDay changes If potentially contributing
-
Handle real incident Normal incident process
-
Decide on continuation Resume or reschedule GameDay?
Always prioritize real incidents over GameDay.
Follow-Up Phase
Hot Debrief
Immediately after GameDay:
Duration: 30-60 minutes Participants: All GameDay participants
Agenda:
-
What happened? (5 min per scenario)
- Timeline walk-through
- Key observations
-
What worked well?
- Celebrate successes
- Note effective practices
-
What didn't work?
- Issues discovered
- Gaps in tools/process
-
Initial action items
- Quick fixes
- Further investigation needed
-
Next steps
- Postmortem schedule
- Owner assignments
Formal Postmortem
Within 1 week of GameDay:
GameDay Postmortem
Executive Summary Brief overview of objectives, execution, outcomes
Scenarios Executed
| Scenario | Outcome | Key Findings |
|---|---|---|
| DB failover | Success | 3 min recovery |
| Network partition | Partial | Manual intervention needed |
Detailed Findings
Scenario 1: Database Failover
- Hypothesis: Automatic failover < 5 min
- Result: CONFIRMED (3 min actual)
- Observations: [Details]
Scenario 2: Network Partition
- Hypothesis: Services continue with degraded mode
- Result: PARTIALLY CONFIRMED
- Gap: Service X didn't handle gracefully
- Observations: [Details]
Action Items
| Action | Owner | Priority | Due Date |
|---|---|---|---|
| Fix Service X partition handling | @engineer | P1 | 2024-02-01 |
| Update runbook for DB failover | @oncall | P2 | 2024-02-15 |
Recommendations for Next GameDay
- [Suggestion 1]
- [Suggestion 2]
Action Item Tracking
Every action item needs:
- Clear description
- Single owner
- Priority level
- Due date
- Definition of done
Track in:
- Issue tracker
- Dedicated dashboard
- Regular review meetings
Don't let action items languish. The point is to improve.
Best Practices
Planning
-
Start small First GameDay should be simple
-
Clear objectives Know what you're testing
-
Stakeholder buy-in Get approval and support
-
Thorough preparation Don't rush the prep work
-
Documented scenarios Written plans, not in heads
Execution
-
Safety first Kill switches ready
-
Communicate constantly Everyone knows what's happening
-
Document everything You'll forget otherwise
-
Stay on schedule Don't let scenarios run over
-
Be flexible Adapt to unexpected situations
Follow-Up
-
Debrief immediately Hot debrief same day
-
Formal postmortem Within a week
-
Track action items Don't let them die
-
Share learnings Spread knowledge broadly
-
Plan the next one Make it a regular practice
Common Pitfalls
Pitfall: Scope creep Fix: Strict scenario limits, time boxes
Pitfall: Insufficient preparation Fix: Checklists, dry runs
Pitfall: No safety measures Fix: Required kill switches, abort criteria
Pitfall: Skipping documentation Fix: Dedicated scribe, templates
Pitfall: Orphaned action items Fix: Tracked, owned, reviewed
Pitfall: Infrequent GameDays Fix: Quarterly schedule, smaller scope
Maturity Progression
Level 1: Ad-hoc
- First GameDay
- Simple scenarios
- Manual execution
Level 2: Regular
- Quarterly GameDays
- Multiple scenarios
- Basic automation
Level 3: Integrated
- Monthly GameDays
- Complex scenarios
- Good documentation
- Action item tracking
Level 4: Continuous
- Weekly smaller drills
- Quarterly large GameDays
- Automated scenarios
- Metrics-driven improvement
Related Skills
-
chaos-engineering-fundamentals
-
Continuous chaos experiments
-
incident-response
-
Handling real incidents
-
resilience-patterns
-
Building resilient systems