Blue-Green & Deployment Strategies
Implement zero-downtime deployment patterns for production systems.
When to Use This Skill
Use this skill when:
-
Implementing zero-downtime deployments
-
Reducing deployment risk
-
Enabling instant rollbacks
-
Running canary releases
-
Performing A/B testing in production
Prerequisites
-
Load balancer or ingress controller
-
Container orchestration (K8s) or cloud platform
-
CI/CD pipeline
-
Health check endpoints
Deployment Strategy Overview
┌─────────────────────────────────────────────────────────────┐ │ DEPLOYMENT STRATEGIES │ ├─────────────┬─────────────┬─────────────┬──────────────────┤ │ Blue-Green │ Canary │ Rolling │ Recreate │ ├─────────────┼─────────────┼─────────────┼──────────────────┤ │ Full env │ Gradual % │ Pod by pod │ All at once │ │ swap │ rollout │ replacement │ │ ├─────────────┼─────────────┼─────────────┼──────────────────┤ │ Instant │ Slow, safe │ Moderate │ Fast, risky │ │ rollback │ rollback │ rollback │ │ ├─────────────┼─────────────┼─────────────┼──────────────────┤ │ 2x resources│ +10-25% │ Same │ Same │ │ needed │ resources │ resources │ │ └─────────────┴─────────────┴─────────────┴──────────────────┘
Blue-Green Deployment
Concept
Before: ┌─────────┐ ┌───────────────┐ │ Users │────▶│ Blue (v1) │ ◀── Active └─────────┘ └───────────────┘ ┌───────────────┐ │ Green (v2) │ ◀── Staging └───────────────┘
After Switch: ┌─────────┐ ┌───────────────┐ │ Users │ │ Blue (v1) │ ◀── Standby └─────────┘ └───────────────┘ │ ┌───────────────┐ └────────▶│ Green (v2) │ ◀── Active └───────────────┘
Kubernetes Implementation
blue-deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: myapp-blue labels: app: myapp version: blue spec: replicas: 3 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: myapp image: myapp:v1.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 5
green-deployment.yaml
apiVersion: apps/v1 kind: Deployment metadata: name: myapp-green labels: app: myapp version: green spec: replicas: 3 selector: matchLabels: app: myapp version: green template: metadata: labels: app: myapp version: green spec: containers: - name: myapp image: myapp:v2.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 5
service.yaml - Switch by changing selector
apiVersion: v1 kind: Service metadata: name: myapp spec: selector: app: myapp version: blue # Change to 'green' to switch ports:
- port: 80 targetPort: 8080
Switch Script
#!/bin/bash
blue-green-switch.sh
CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}') NEW_VERSION=$1
echo "Current version: $CURRENT" echo "Switching to: $NEW_VERSION"
Verify new deployment is ready
kubectl rollout status deployment/myapp-$NEW_VERSION
Check health
HEALTH=$(kubectl exec -it deployment/myapp-$NEW_VERSION -- curl -s localhost:8080/health) if [ "$HEALTH" != "ok" ]; then echo "Health check failed" exit 1 fi
Switch traffic
kubectl patch svc myapp -p "{"spec":{"selector":{"version":"$NEW_VERSION"}}}"
echo "Switched to $NEW_VERSION"
AWS ECS Blue-Green
AWS CodeDeploy appspec.yml
version: 0.0 Resources:
- TargetService: Type: AWS::ECS::Service Properties: TaskDefinition: "arn:aws:ecs:region:account:task-definition/myapp:2" LoadBalancerInfo: ContainerName: "myapp" ContainerPort: 8080 Hooks:
- BeforeInstall: "LambdaFunctionToValidateBeforeTrafficShift"
- AfterInstall: "LambdaFunctionToValidateAfterTrafficShift"
- AfterAllowTestTraffic: "LambdaFunctionToValidateTestTraffic"
- BeforeAllowTraffic: "LambdaFunctionToValidateBeforeAllowTraffic"
- AfterAllowTraffic: "LambdaFunctionToValidateAfterAllowTraffic"
Canary Deployment
Kubernetes with Istio
VirtualService for traffic splitting
apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: myapp spec: hosts:
- myapp http:
- match:
- headers: x-canary: exact: "true" route:
- destination: host: myapp subset: canary
- route:
- destination: host: myapp subset: stable weight: 90
- destination: host: myapp subset: canary weight: 10
apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: myapp spec: host: myapp subsets:
- name: stable labels: version: stable
- name: canary labels: version: canary
Argo Rollouts
apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: myapp spec: replicas: 5 strategy: canary: steps: - setWeight: 10 - pause: {duration: 5m} - setWeight: 25 - pause: {duration: 5m} - setWeight: 50 - pause: {duration: 5m} - setWeight: 75 - pause: {duration: 5m} analysis: templates: - templateName: success-rate startingStep: 2 args: - name: service-name value: myapp selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:v2.0.0 ports: - containerPort: 8080
apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: success-rate spec: args:
- name: service-name metrics:
- name: success-rate interval: 1m successCondition: result[0] >= 0.95 failureLimit: 3 provider: prometheus: address: http://prometheus:9090 query: | sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.*"}[5m])) / sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
Rolling Deployment
Kubernetes Default
apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: replicas: 5 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # Max pods above desired maxUnavailable: 0 # Max pods unavailable selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:v2.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 10
Rolling Update Commands
Update image
kubectl set image deployment/myapp myapp=myapp:v2.0.0
Watch rollout
kubectl rollout status deployment/myapp
Pause rollout
kubectl rollout pause deployment/myapp
Resume rollout
kubectl rollout resume deployment/myapp
Rollback
kubectl rollout undo deployment/myapp
Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=2
View history
kubectl rollout history deployment/myapp
Health Checks
Comprehensive Health Endpoint
Flask health endpoint
from flask import Flask, jsonify import psycopg2 import redis
app = Flask(name)
@app.route('/health') def health(): """Liveness probe - is the app running?""" return jsonify({'status': 'healthy'}), 200
@app.route('/ready') def ready(): """Readiness probe - can the app serve traffic?""" checks = {}
# Database check
try:
conn = psycopg2.connect(DATABASE_URL)
conn.close()
checks['database'] = 'ok'
except Exception as e:
checks['database'] = str(e)
return jsonify({'status': 'unhealthy', 'checks': checks}), 503
# Redis check
try:
r = redis.from_url(REDIS_URL)
r.ping()
checks['redis'] = 'ok'
except Exception as e:
checks['redis'] = str(e)
return jsonify({'status': 'unhealthy', 'checks': checks}), 503
return jsonify({'status': 'healthy', 'checks': checks}), 200
Rollback Procedures
Automated Rollback
#!/bin/bash
auto-rollback.sh
DEPLOYMENT=$1 THRESHOLD=0.95 INTERVAL=60
echo "Monitoring deployment $DEPLOYMENT"
while true; do
Get success rate from Prometheus
SUCCESS_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~\"2.*\"}[5m]))/sum(rate(http_requests_total[5m]))" | jq -r '.data.result[0].value[1]')
echo "Current success rate: $SUCCESS_RATE"
if (( $(echo "$SUCCESS_RATE < $THRESHOLD" | bc -l) )); then echo "Success rate below threshold! Rolling back..." kubectl rollout undo deployment/$DEPLOYMENT exit 1 fi
sleep $INTERVAL done
Manual Rollback Checklist
Rollback Checklist
Before Rollback
- Confirm issue is deployment-related
- Document current error rates
- Notify team in #deployments channel
During Rollback
- Execute rollback command
- Monitor rollback progress
- Verify old version is serving traffic
After Rollback
- Confirm error rates normalized
- Update incident ticket
- Schedule post-mortem
Common Issues
Issue: Slow Deployments
Problem: Rollout takes too long Solution: Increase maxSurge, decrease minReadySeconds
Issue: Failed Health Checks
Problem: Pods not becoming ready Solution: Check probe endpoints, increase timeouts
Issue: Traffic During Rollback
Problem: Errors during switch Solution: Use connection draining, implement graceful shutdown
Best Practices
-
Always implement health checks
-
Use connection draining
-
Test rollback procedures regularly
-
Monitor key metrics during deployment
-
Implement circuit breakers
-
Use deployment slots/environments
-
Automate deployment verification
-
Document rollback procedures
Related Skills
-
kubernetes-ops - K8s deployment basics
-
argocd-gitops - GitOps deployments
-
feature-flags - Progressive rollout