multi-region-deployment

Multi-Region Deployment

Comprehensive guide to deploying applications across multiple geographic regions for availability, performance, and disaster recovery.

When to Use This Skill

Designing globally distributed applications
Implementing disaster recovery (DR)
Reducing latency for global users
Meeting data residency requirements
Achieving high availability (99.99%+)
Planning failover strategies

Multi-Region Fundamentals

Why Multi-Region?

Reasons for Multi-Region:

High Availability └── Survive region-wide failures └── Natural disasters, power outages └── Target: 99.99%+ uptime
Low Latency └── Serve users from nearest region └── Reduce round-trip time └── Better user experience
Data Residency └── GDPR, data sovereignty laws └── Keep data in specific countries └── Compliance requirements
Disaster Recovery └── Business continuity └── RTO/RPO requirements └── Regulatory requirements

Trade-offs:

Higher availability
Lower latency globally
Compliance capability

Higher cost (2x-3x or more)
Increased complexity
Data consistency challenges

Deployment Models

Model 1: Active-Passive (DR) ┌─────────────────┐ ┌─────────────────┐ │ PRIMARY (Active)│ │ SECONDARY (Passive)│ │ ┌─────────────┐│ │ ┌─────────────┐│ │ │ App ││ ──► │ │ App ││ │ │ (Live) ││ Sync │ │ (Standby) ││ │ └─────────────┘│ │ └─────────────┘│ │ ┌─────────────┐│ │ ┌─────────────┐│ │ │ DB ││ ──► │ │ DB ││ │ │ (Primary) ││ Replic │ │ (Replica) ││ │ └─────────────┘│ │ └─────────────┘│ └─────────────────┘ └─────────────────┘ All traffic Failover only

Model 2: Active-Active (Load Distributed) ┌─────────────────┐ ┌─────────────────┐ │ REGION A │ │ REGION B │ │ ┌─────────────┐│ ◄──► │ ┌─────────────┐│ │ │ App ││ Users │ │ App ││ │ │ (Active) ││ routed │ │ (Active) ││ │ └─────────────┘│ by │ └─────────────┘│ │ ┌─────────────┐│ location│ ┌─────────────┐│ │ │ DB ││ ◄──► │ │ DB ││ │ │ (Primary) ││ Replic │ │ (Primary) ││ │ └─────────────┘│ Both │ └─────────────┘│ └─────────────────┘ ways └─────────────────┘ Serves Region A Serves Region B

Model 3: Active-Active-Active (Global) ┌──────┐ ┌──────┐ ┌──────┐ │ US │◄──►│ EU │◄──►│ APAC │ │Active│ │Active│ │Active│ └──┬───┘ └──┬───┘ └──┬───┘ │ │ │ └───────────┼───────────┘ │ Global Load Balancer routes by location

Region Selection

Selection Criteria

Region Selection Factors:

User Location □ Where are your users? □ Latency requirements per region? □ User concentration (80/20 rule)?
Compliance Requirements □ Data residency laws (GDPR, etc.) □ Government regulations □ Industry requirements (HIPAA, PCI)
Cloud Provider Availability □ Not all services in all regions □ Service feature parity □ Regional pricing differences
Network Connectivity □ Internet exchange points □ Direct connect options □ Cross-region latency
Disaster Risk □ Natural disaster patterns □ Political stability □ Infrastructure reliability
Cost □ Compute/storage pricing varies □ Data transfer costs (egress) □ Support availability

Common Region Pairs

Region Pair Strategy:

Americas:

Primary: US East (N. Virginia)
Secondary: US West (Oregon) or US East (Ohio)
Distance: 2,500-3,000 km
Latency: ~60ms

Europe:

Primary: EU West (Ireland)
Secondary: EU Central (Frankfurt) or EU West (London)
Distance: ~1,000-1,500 km
Latency: ~20-30ms

Asia Pacific:

Primary: Singapore or Tokyo
Secondary: Sydney or Mumbai
Distance: 5,000-7,000 km
Latency: ~100-150ms

Global Triad:

US East + EU West + Singapore/Tokyo
Covers most global users
<100ms to 80%+ of users

Data Replication

Replication Patterns

Pattern 1: Async Replication (Most Common) Primary ──────► Replica lag: ms to seconds

Lower latency for writes
Primary not blocked by replica

Potential data loss on failover (RPO > 0)
Replication lag visible

Pattern 2: Sync Replication Primary ◄─────► Replica both confirm

No data loss on failover (RPO = 0)
Strong consistency

Higher write latency
Availability coupled to both regions

Pattern 3: Semi-Sync Replication Primary ──────► At least 1 Replica (sync) └────► Other Replicas (async)

Guaranteed durability for some replicas
Balance of latency and durability

More complex failure handling

Conflict Resolution

Multi-Primary Conflict Resolution:

Scenario: Same record updated in two regions simultaneously

Resolution Strategies:

Last Write Wins (LWW) └── Timestamp-based └── Simple but can lose data └── Clock sync important
First Write Wins └── First committed wins └── Later writes rejected or queued └── Good for "create once" data
Application-Level Resolution └── Custom merge logic └── Most flexible └── Most complex
CRDTs (Conflict-free Replicated Data Types) └── Mathematically guaranteed convergence └── Counters, sets, maps └── Good for specific use cases

Best Practice:

Design to avoid conflicts where possible
Partition data by region when appropriate
Use single-primary for conflict-sensitive data

Failover Strategies

Failover Types

Failover Types:

DNS-Based Failover ┌─────────────────────────────────────────┐ │ DNS Health Check │ │ ├── Check primary every 10-30s │ │ ├── 3 consecutive failures = unhealthy│ │ └── Update DNS to point to secondary │ └─────────────────────────────────────────┘

RTO: 60-300 seconds (DNS TTL + propagation) Pros: Simple, works with any app Cons: Slow failover, DNS caching issues
Load Balancer Failover ┌─────────────────────────────────────────┐ │ Global Load Balancer │ │ ├── Continuous health checks │ │ ├── Instant routing changes │ │ └── No DNS propagation wait │ └─────────────────────────────────────────┘

RTO: 10-60 seconds Pros: Fast, reliable Cons: Requires GLB, potential single point
Application-Level Failover ┌─────────────────────────────────────────┐ │ Client/App Aware │ │ ├── Client retries to alternate region│ │ ├── SDK handles failover │ │ └── No infrastructure dependency │ └─────────────────────────────────────────┘

RTO: 1-10 seconds Pros: Fastest, most control Cons: Requires client changes

RTO and RPO

Recovery Objectives:

RTO (Recovery Time Objective): └── Maximum acceptable downtime └── Time from failure to recovery └── Drives failover automation investment

RPO (Recovery Point Objective): └── Maximum acceptable data loss └── Time between last backup and failure └── Drives replication strategy

Common Targets: ┌──────────────┬──────────┬──────────┬───────────────────┐ │ Tier │ RTO │ RPO │ Strategy │ ├──────────────┼──────────┼──────────┼───────────────────┤ │ Critical │ <1 min │ 0 │ Active-active │ │ │ │ │ Sync replication │ ├──────────────┼──────────┼──────────┼───────────────────┤ │ High │ <15 min │ <1 min │ Active-passive │ │ │ │ │ Hot standby │ ├──────────────┼──────────┼──────────┼───────────────────┤ │ Medium │ <4 hours │ <1 hour │ Warm standby │ │ │ │ │ Async replication │ ├──────────────┼──────────┼──────────┼───────────────────┤ │ Low │ <24 hours│ <24 hours│ Backup/Restore │ │ │ │ │ Pilot light │ └──────────────┴──────────┴──────────┴───────────────────┘

Traffic Routing

Global Load Balancing

GLB Routing Policies:

Geolocation Routing └── Route by user's geographic location └── Europe users → EU region └── Fallback for unmapped locations
Latency-Based Routing └── Route to lowest latency region └── Based on real measurements └── Adapts to network conditions
Weighted Routing └── Split traffic by percentage └── Good for rollouts, testing └── Example: 90% primary, 10% secondary
Failover Routing └── Primary region until unhealthy └── Automatic switch to secondary └── Health check driven

Cloud Implementations:

AWS: Route 53, Global Accelerator
Azure: Traffic Manager, Front Door
GCP: Cloud Load Balancing
Cloudflare: Load Balancing

Session Handling

Session Affinity in Multi-Region:

Challenge: User session state across regions

Option 1: Sticky Sessions └── User stays in same region for session └── Failover loses session └── Simple but limited DR

Option 2: Centralized Session Store └── Session in Redis/database └── All regions access same store └── Adds latency, single point of failure

Option 3: Distributed Session Store └── Redis Cluster across regions └── Session replicated └── Complex but resilient

Option 4: Stateless (JWT/Token) └── Session in client-side token └── No server-side state └── Best for multi-region

Recommendation:

Prefer stateless where possible
If stateful, use distributed store
Design for session loss on failover

Database Patterns

Database Deployment Options

Option 1: Single Primary + Read Replicas ┌───────────────┐ ┌───────────────┐ │ US-EAST │ │ EU-WEST │ │ ┌─────────┐ │ ───► │ ┌─────────┐ │ │ │ Primary │ │ Async │ │ Replica │ │ │ │ (R/W) │ │ Replic │ │ (Read) │ │ │ └─────────┘ │ │ └─────────┘ │ └───────────────┘ └───────────────┘

Writes go to primary region
Reads served locally
Failover promotes replica

Option 2: Multi-Primary (Active-Active) ┌───────────────┐ ┌───────────────┐ │ US-EAST │◄───────►│ EU-WEST │ │ ┌─────────┐ │ Bi-dir │ ┌─────────┐ │ │ │ Primary │ │ Replic │ │ Primary │ │ │ │ (R/W) │ │ │ │ (R/W) │ │ │ └─────────┘ │ │ └─────────┘ │ └───────────────┘ └───────────────┘

Writes accepted in both regions
Conflict resolution required
Complex but lowest latency

Option 3: Globally Distributed Database ┌─────────────────────────────────────────┐ │ CockroachDB / Spanner / YugabyteDB │ │ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ US │────│ EU │────│ APAC│ │ │ └─────┘ └─────┘ └─────┘ │ │ Automatic sharding and replication │ └─────────────────────────────────────────┘

Database handles distribution
Strong consistency available
Higher latency for writes

Testing and Validation

Chaos Engineering for Multi-Region

Multi-Region Chaos Tests:

Region Failover Test □ Fail primary region completely □ Measure failover time □ Verify data integrity □ Test user experience
Network Partition Test □ Block inter-region communication □ Verify split-brain handling □ Test conflict resolution
Partial Failure Test □ Fail subset of services in region □ Test degraded operation □ Verify monitoring/alerting
Data Replication Lag Test □ Introduce artificial lag □ Test application behavior □ Verify consistency expectations
Failback Test □ Restore failed region □ Test data sync □ Test traffic redistribution

Schedule:

Failover tests: Monthly
Full DR drill: Quarterly
Chaos experiments: Weekly

Best Practices

Multi-Region Best Practices:

Design for Failure □ Assume any region can fail □ No single points of failure □ Automated failover □ Regular testing
Data Strategy □ Define consistency requirements □ Choose appropriate replication □ Plan for conflicts □ Consider data residency
Observability □ Cross-region metrics □ Distributed tracing □ Centralized logging □ Region-aware alerting
Cost Management □ Right-size standby resources □ Use reserved capacity wisely □ Monitor data transfer costs □ Consider traffic patterns
Operational Readiness □ Runbooks for failover □ Regular DR drills □ On-call training □ Post-incident reviews

Related Skills

latency-optimization
Reducing global latency
distributed-consensus
Consistency patterns
cdn-architecture
Edge caching for multi-region
chaos-engineering-fundamentals
Testing resilience

multi-region-deployment

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering