Blue-Green & Deployment Strategies

Implement zero-downtime deployment patterns for production systems.

When to Use This Skill

Use this skill when:

Implementing zero-downtime deployments
Reducing deployment risk
Enabling instant rollbacks
Running canary releases
Performing A/B testing in production

Prerequisites

Load balancer or ingress controller
Container orchestration (K8s) or cloud platform
CI/CD pipeline
Health check endpoints

Deployment Strategy Overview

┌─────────────────────────────────────────────────────────────┐ │ DEPLOYMENT STRATEGIES │ ├─────────────┬─────────────┬─────────────┬──────────────────┤ │ Blue-Green │ Canary │ Rolling │ Recreate │ ├─────────────┼─────────────┼─────────────┼──────────────────┤ │ Full env │ Gradual % │ Pod by pod │ All at once │ │ swap │ rollout │ replacement │ │ ├─────────────┼─────────────┼─────────────┼──────────────────┤ │ Instant │ Slow, safe │ Moderate │ Fast, risky │ │ rollback │ rollback │ rollback │ │ ├─────────────┼─────────────┼─────────────┼──────────────────┤ │ 2x resources│ +10-25% │ Same │ Same │ │ needed │ resources │ resources │ │ └─────────────┴─────────────┴─────────────┴──────────────────┘

Blue-Green Deployment

Concept

Before: ┌─────────┐ ┌───────────────┐ │ Users │────▶│ Blue (v1) │ ◀── Active └─────────┘ └───────────────┘ ┌───────────────┐ │ Green (v2) │ ◀── Staging └───────────────┘

After Switch: ┌─────────┐ ┌───────────────┐ │ Users │ │ Blue (v1) │ ◀── Standby └─────────┘ └───────────────┘ │ ┌───────────────┐ └────────▶│ Green (v2) │ ◀── Active └───────────────┘

Kubernetes Implementation

blue-deployment.yaml

apiVersion: apps/v1 kind: Deployment metadata: name: myapp-blue labels: app: myapp version: blue spec: replicas: 3 selector: matchLabels: app: myapp version: blue template: metadata: labels: app: myapp version: blue spec: containers: - name: myapp image: myapp:v1.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 5

green-deployment.yaml

apiVersion: apps/v1 kind: Deployment metadata: name: myapp-green labels: app: myapp version: green spec: replicas: 3 selector: matchLabels: app: myapp version: green template: metadata: labels: app: myapp version: green spec: containers: - name: myapp image: myapp:v2.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 5 periodSeconds: 5

service.yaml - Switch by changing selector

apiVersion: v1 kind: Service metadata: name: myapp spec: selector: app: myapp version: blue # Change to 'green' to switch ports:

port: 80 targetPort: 8080

Switch Script

#!/bin/bash

blue-green-switch.sh

CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}') NEW_VERSION=$1

echo "Current version: $CURRENT" echo "Switching to: $NEW_VERSION"

Verify new deployment is ready

kubectl rollout status deployment/myapp-$NEW_VERSION

Check health

HEALTH=$(kubectl exec -it deployment/myapp-$NEW_VERSION -- curl -s localhost:8080/health) if [ "$HEALTH" != "ok" ]; then echo "Health check failed" exit 1 fi

Switch traffic

kubectl patch svc myapp -p "{"spec":{"selector":{"version":"$NEW_VERSION"}}}"

echo "Switched to $NEW_VERSION"

AWS ECS Blue-Green

AWS CodeDeploy appspec.yml

version: 0.0 Resources:

TargetService: Type: AWS::ECS::Service Properties: TaskDefinition: "arn:aws:ecs:region:account:task-definition/myapp:2" LoadBalancerInfo: ContainerName: "myapp" ContainerPort: 8080 Hooks:
BeforeInstall: "LambdaFunctionToValidateBeforeTrafficShift"
AfterInstall: "LambdaFunctionToValidateAfterTrafficShift"
AfterAllowTestTraffic: "LambdaFunctionToValidateTestTraffic"
BeforeAllowTraffic: "LambdaFunctionToValidateBeforeAllowTraffic"
AfterAllowTraffic: "LambdaFunctionToValidateAfterAllowTraffic"

Canary Deployment

Kubernetes with Istio

VirtualService for traffic splitting

apiVersion: networking.istio.io/v1beta1 kind: VirtualService metadata: name: myapp spec: hosts:

myapp http:
match:
- headers: x-canary: exact: "true" route:
- destination: host: myapp subset: canary
route:
- destination: host: myapp subset: stable weight: 90
- destination: host: myapp subset: canary weight: 10

apiVersion: networking.istio.io/v1beta1 kind: DestinationRule metadata: name: myapp spec: host: myapp subsets:

name: stable labels: version: stable
name: canary labels: version: canary

Argo Rollouts

apiVersion: argoproj.io/v1alpha1 kind: Rollout metadata: name: myapp spec: replicas: 5 strategy: canary: steps: - setWeight: 10 - pause: {duration: 5m} - setWeight: 25 - pause: {duration: 5m} - setWeight: 50 - pause: {duration: 5m} - setWeight: 75 - pause: {duration: 5m} analysis: templates: - templateName: success-rate startingStep: 2 args: - name: service-name value: myapp selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:v2.0.0 ports: - containerPort: 8080

apiVersion: argoproj.io/v1alpha1 kind: AnalysisTemplate metadata: name: success-rate spec: args:

name: service-name metrics:
name: success-rate interval: 1m successCondition: result[0] >= 0.95 failureLimit: 3 provider: prometheus: address: http://prometheus:9090 query: | sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.*"}[5m])) / sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))

Rolling Deployment

Kubernetes Default

apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: replicas: 5 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 # Max pods above desired maxUnavailable: 0 # Max pods unavailable selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:v2.0.0 ports: - containerPort: 8080 readinessProbe: httpGet: path: /ready port: 8080 initialDelaySeconds: 5 periodSeconds: 5 livenessProbe: httpGet: path: /health port: 8080 initialDelaySeconds: 10 periodSeconds: 10

Rolling Update Commands

Update image

kubectl set image deployment/myapp myapp=myapp:v2.0.0

Watch rollout

kubectl rollout status deployment/myapp

Pause rollout

kubectl rollout pause deployment/myapp

Resume rollout

kubectl rollout resume deployment/myapp

Rollback

kubectl rollout undo deployment/myapp

Rollback to specific revision

kubectl rollout undo deployment/myapp --to-revision=2

View history

kubectl rollout history deployment/myapp

Health Checks

Comprehensive Health Endpoint

Flask health endpoint

from flask import Flask, jsonify import psycopg2 import redis

app = Flask(name)

@app.route('/health') def health(): """Liveness probe - is the app running?""" return jsonify({'status': 'healthy'}), 200

@app.route('/ready') def ready(): """Readiness probe - can the app serve traffic?""" checks = {}

# Database check
try:
    conn = psycopg2.connect(DATABASE_URL)
    conn.close()
    checks['database'] = 'ok'
except Exception as e:
    checks['database'] = str(e)
    return jsonify({'status': 'unhealthy', 'checks': checks}), 503

# Redis check
try:
    r = redis.from_url(REDIS_URL)
    r.ping()
    checks['redis'] = 'ok'
except Exception as e:
    checks['redis'] = str(e)
    return jsonify({'status': 'unhealthy', 'checks': checks}), 503

return jsonify({'status': 'healthy', 'checks': checks}), 200

Rollback Procedures

Automated Rollback

#!/bin/bash

auto-rollback.sh

DEPLOYMENT=$1 THRESHOLD=0.95 INTERVAL=60

echo "Monitoring deployment $DEPLOYMENT"

while true; do

Get success rate from Prometheus

SUCCESS_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~\"2.*\"}[5m]))/sum(rate(http_requests_total[5m]))" | jq -r '.data.result[0].value[1]')

echo "Current success rate: $SUCCESS_RATE"

if (( $(echo "$SUCCESS_RATE < $THRESHOLD" | bc -l) )); then echo "Success rate below threshold! Rolling back..." kubectl rollout undo deployment/$DEPLOYMENT exit 1 fi

sleep $INTERVAL done

Manual Rollback Checklist

Rollback Checklist

Before Rollback

Confirm issue is deployment-related
Document current error rates
Notify team in #deployments channel

During Rollback

Execute rollback command
Monitor rollback progress
Verify old version is serving traffic

After Rollback

Confirm error rates normalized
Update incident ticket
Schedule post-mortem

Common Issues

Issue: Slow Deployments

Problem: Rollout takes too long Solution: Increase maxSurge, decrease minReadySeconds

Issue: Failed Health Checks

Problem: Pods not becoming ready Solution: Check probe endpoints, increase timeouts

Issue: Traffic During Rollback

Problem: Errors during switch Solution: Use connection draining, implement graceful shutdown

Best Practices

Always implement health checks
Use connection draining
Test rollback procedures regularly
Monitor key metrics during deployment
Implement circuit breakers
Use deployment slots/environments
Automate deployment verification
Document rollback procedures

Related Skills

kubernetes-ops - K8s deployment basics
argocd-gitops - GitOps deployments
feature-flags - Progressive rollout

blue-green-deploy

Safety Notice

Copy this and send it to your AI assistant to learn