DevOps Expert
Expert guidance for DevOps practices, culture, CI/CD pipelines, infrastructure automation, and operational excellence.
Core Concepts
DevOps Culture
-
Collaboration and communication
-
Shared responsibility
-
Continuous improvement
-
Breaking down silos
-
Blameless culture
-
Measuring everything
Automation
-
Infrastructure as Code (IaC)
-
Configuration management
-
Deployment automation
-
Testing automation
-
Monitoring automation
-
Self-service platforms
CI/CD
-
Continuous Integration
-
Continuous Delivery
-
Continuous Deployment
-
Pipeline as Code
-
Artifact management
-
Release strategies
CI/CD Pipeline
GitHub Actions Example
name: CI/CD Pipeline
on: push: branches: [main, develop] pull_request: branches: [main]
env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }}
jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3
- name: Setup Node.js
uses: actions/setup-node@v3
with:
node-version: '18'
cache: 'npm'
- name: Install dependencies
run: npm ci
- name: Run linting
run: npm run lint
- name: Run tests
run: npm test
- name: Run security scan
run: npm audit
- name: Upload coverage
uses: codecov/codecov-action@v3
build: needs: test runs-on: ubuntu-latest permissions: contents: read packages: write
steps:
- uses: actions/checkout@v3
- name: Log in to Container Registry
uses: docker/login-action@v2
with:
registry: ${{ env.REGISTRY }}
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v4
with:
images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}
- name: Build and push Docker image
uses: docker/build-push-action@v4
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
deploy-staging: needs: build if: github.ref == 'refs/heads/develop' runs-on: ubuntu-latest environment: staging
steps:
- name: Deploy to staging
run: |
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace=staging
- name: Wait for rollout
run: kubectl rollout status deployment/myapp -n staging
- name: Run smoke tests
run: npm run test:smoke
deploy-production: needs: build if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: production
steps:
- name: Deploy to production
run: |
kubectl set image deployment/myapp \
myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
--namespace=production
- name: Wait for rollout
run: kubectl rollout status deployment/myapp -n production
Infrastructure as Code
Pulumi Infrastructure as Code
import pulumi import pulumi_aws as aws
VPC
vpc = aws.ec2.Vpc("main-vpc", cidr_block="10.0.0.0/16", enable_dns_hostnames=True, enable_dns_support=True, tags={"Name": "main-vpc"})
Subnets
public_subnet = aws.ec2.Subnet("public-subnet", vpc_id=vpc.id, cidr_block="10.0.1.0/24", availability_zone="us-east-1a", map_public_ip_on_launch=True, tags={"Name": "public-subnet"})
private_subnet = aws.ec2.Subnet("private-subnet", vpc_id=vpc.id, cidr_block="10.0.2.0/24", availability_zone="us-east-1b", tags={"Name": "private-subnet"})
Internet Gateway
igw = aws.ec2.InternetGateway("igw", vpc_id=vpc.id, tags={"Name": "main-igw"})
Route Table
route_table = aws.ec2.RouteTable("public-rt", vpc_id=vpc.id, routes=[ aws.ec2.RouteTableRouteArgs( cidr_block="0.0.0.0/0", gateway_id=igw.id, ) ], tags={"Name": "public-rt"})
Security Group
security_group = aws.ec2.SecurityGroup("web-sg", vpc_id=vpc.id, description="Allow HTTP and HTTPS traffic", ingress=[ aws.ec2.SecurityGroupIngressArgs( protocol="tcp", from_port=80, to_port=80, cidr_blocks=["0.0.0.0/0"], ), aws.ec2.SecurityGroupIngressArgs( protocol="tcp", from_port=443, to_port=443, cidr_blocks=["0.0.0.0/0"], ), ], egress=[ aws.ec2.SecurityGroupEgressArgs( protocol="-1", from_port=0, to_port=0, cidr_blocks=["0.0.0.0/0"], ) ])
EKS Cluster
cluster = aws.eks.Cluster("app-cluster", role_arn=cluster_role.arn, vpc_config=aws.eks.ClusterVpcConfigArgs( subnet_ids=[public_subnet.id, private_subnet.id], security_group_ids=[security_group.id], ))
Export outputs
pulumi.export("vpc_id", vpc.id) pulumi.export("cluster_name", cluster.name) pulumi.export("cluster_endpoint", cluster.endpoint)
Deployment Strategies
from typing import List, Dict import time
class DeploymentStrategy: """Implement various deployment strategies"""
def __init__(self, service_name: str):
self.service_name = service_name
def blue_green_deployment(self, blue_version: str, green_version: str):
"""Blue-Green deployment"""
# Deploy green environment
self.deploy_environment("green", green_version)
# Run tests on green
if self.run_tests("green"):
# Switch traffic to green
self.switch_traffic("green")
# Keep blue for rollback
print(f"Deployment successful. Blue ({blue_version}) kept for rollback.")
else:
# Rollback - keep blue active
print("Tests failed on green. Keeping blue active.")
def canary_deployment(self, current_version: str, new_version: str,
canary_percentage: int = 10):
"""Canary deployment"""
# Deploy canary with small percentage
self.deploy_canary(new_version, canary_percentage)
# Monitor metrics
metrics = self.monitor_canary_metrics(duration_minutes=10)
if metrics['error_rate'] < 0.1 and metrics['latency_p95'] < 500:
# Gradually increase canary traffic
for percentage in [25, 50, 75, 100]:
self.update_canary_traffic(percentage)
time.sleep(300) # 5 minutes between increases
if not self.check_health():
self.rollback(current_version)
return False
print(f"Canary deployment successful: {new_version}")
return True
else:
self.rollback(current_version)
print("Canary deployment failed - rolled back")
return False
def rolling_deployment(self, version: str, batch_size: int = 1):
"""Rolling deployment"""
instances = self.get_instances()
for i in range(0, len(instances), batch_size):
batch = instances[i:i + batch_size]
# Update batch
for instance in batch:
self.update_instance(instance, version)
self.wait_for_healthy(instance)
# Verify batch health
if not self.check_health():
print(f"Rolling deployment failed at batch {i//batch_size + 1}")
return False
print(f"Rolling deployment successful: {version}")
return True
def feature_flag_deployment(self, feature_name: str, enabled: bool,
rollout_percentage: int = 100):
"""Feature flag based deployment"""
return {
'feature': feature_name,
'enabled': enabled,
'rollout_percentage': rollout_percentage,
'targeting': {
'user_segments': ['beta_users'] if rollout_percentage < 100 else ['all']
}
}
Configuration Management
from typing import Dict, Any import yaml import json
class ConfigurationManager: """Manage application configuration"""
def __init__(self, environment: str):
self.environment = environment
self.config = {}
def load_config(self, config_file: str):
"""Load configuration from file"""
with open(config_file, 'r') as f:
if config_file.endswith('.yaml') or config_file.endswith('.yml'):
self.config = yaml.safe_load(f)
elif config_file.endswith('.json'):
self.config = json.load(f)
def get(self, key: str, default: Any = None) -> Any:
"""Get configuration value"""
keys = key.split('.')
value = self.config
for k in keys:
if isinstance(value, dict):
value = value.get(k)
else:
return default
if value is None:
return default
return value
def merge_environment_config(self, env_config: Dict):
"""Merge environment-specific configuration"""
self.config = self._deep_merge(self.config, env_config)
def _deep_merge(self, base: Dict, override: Dict) -> Dict:
"""Deep merge two dictionaries"""
result = base.copy()
for key, value in override.items():
if key in result and isinstance(result[key], dict) and isinstance(value, dict):
result[key] = self._deep_merge(result[key], value)
else:
result[key] = value
return result
def validate_required_keys(self, required_keys: List[str]) -> List[str]:
"""Validate that required configuration keys exist"""
missing = []
for key in required_keys:
if self.get(key) is None:
missing.append(key)
return missing
Monitoring and Observability
import logging from opencensus.ext.azure import metrics_exporter from opencensus.stats import aggregation as aggregation_module from opencensus.stats import measure as measure_module from opencensus.stats import stats as stats_module from opencensus.stats import view as view_module from opencensus.tags import tag_map as tag_map_module
class ObservabilityStack: """Implement observability best practices"""
def __init__(self):
self.logger = self._setup_logging()
self.stats = stats_module.stats
self.view_manager = self.stats.view_manager
def _setup_logging(self) -> logging.Logger:
"""Setup structured logging"""
logger = logging.getLogger(__name__)
handler = logging.StreamHandler()
formatter = logging.Formatter(
'{"time": "%(asctime)s", "level": "%(levelname)s", '
'"service": "%(name)s", "message": "%(message)s"}'
)
handler.setFormatter(formatter)
logger.addHandler(handler)
logger.setLevel(logging.INFO)
return logger
def log_with_context(self, level: str, message: str, **context):
"""Log with additional context"""
log_func = getattr(self.logger, level)
log_func(message, extra=context)
def track_custom_metric(self, metric_name: str, value: float,
tags: Dict[str, str]):
"""Track custom application metric"""
# Implementation would send to metrics backend
pass
def create_distributed_trace(self, operation_name: str):
"""Create distributed trace span"""
# Implementation would use OpenTelemetry or similar
pass
Best Practices
Culture & Process
-
Foster collaboration between Dev and Ops
-
Automate everything possible
-
Measure and monitor continuously
-
Practice blameless post-mortems
-
Share knowledge and documentation
-
Encourage experimentation
-
Celebrate successes and learn from failures
CI/CD
-
Keep builds fast (<10 minutes)
-
Run tests in parallel
-
Use pipeline as code
-
Implement automated rollbacks
-
Require code review before merge
-
Use trunk-based development
-
Deploy small, frequent changes
Infrastructure
-
Use Infrastructure as Code
-
Version everything (code, config, infrastructure)
-
Implement disaster recovery
-
Practice chaos engineering
-
Use immutable infrastructure
-
Automate security scanning
-
Monitor cloud costs
Anti-Patterns
❌ Manual deployments ❌ Configuration drift ❌ No automated testing ❌ Long-lived feature branches ❌ Blame culture ❌ Siloed teams ❌ Ignoring technical debt
Resources
-
The Phoenix Project: https://itrevolution.com/the-phoenix-project/
-
DevOps Handbook: https://itrevolution.com/the-devops-handbook/
-
State of DevOps Report: https://www.devops-research.com/research.html
-
GitLab CI/CD: https://docs.gitlab.com/ee/ci/
-
GitHub Actions: https://docs.github.com/en/actions