DevOps Expert

Expert guidance for DevOps practices, culture, CI/CD pipelines, infrastructure automation, and operational excellence.

Core Concepts

DevOps Culture

Collaboration and communication
Shared responsibility
Continuous improvement
Breaking down silos
Blameless culture
Measuring everything

Automation

Infrastructure as Code (IaC)
Configuration management
Deployment automation
Testing automation
Monitoring automation
Self-service platforms

CI/CD

Continuous Integration
Continuous Delivery
Continuous Deployment
Pipeline as Code
Artifact management
Release strategies

CI/CD Pipeline

GitHub Actions Example

name: CI/CD Pipeline

on: push: branches: [main, develop] pull_request: branches: [main]

env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }}

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3

  - name: Setup Node.js
    uses: actions/setup-node@v3
    with:
      node-version: '18'
      cache: 'npm'

  - name: Install dependencies
    run: npm ci

  - name: Run linting
    run: npm run lint

  - name: Run tests
    run: npm test

  - name: Run security scan
    run: npm audit

  - name: Upload coverage
    uses: codecov/codecov-action@v3

build: needs: test runs-on: ubuntu-latest permissions: contents: read packages: write

steps:
  - uses: actions/checkout@v3

  - name: Log in to Container Registry
    uses: docker/login-action@v2
    with:
      registry: ${{ env.REGISTRY }}
      username: ${{ github.actor }}
      password: ${{ secrets.GITHUB_TOKEN }}

  - name: Extract metadata
    id: meta
    uses: docker/metadata-action@v4
    with:
      images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

  - name: Build and push Docker image
    uses: docker/build-push-action@v4
    with:
      context: .
      push: true
      tags: ${{ steps.meta.outputs.tags }}
      labels: ${{ steps.meta.outputs.labels }}

deploy-staging: needs: build if: github.ref == 'refs/heads/develop' runs-on: ubuntu-latest environment: staging

steps:
  - name: Deploy to staging
    run: |
      kubectl set image deployment/myapp \
        myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
        --namespace=staging

  - name: Wait for rollout
    run: kubectl rollout status deployment/myapp -n staging

  - name: Run smoke tests
    run: npm run test:smoke

deploy-production: needs: build if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: production

steps:
  - name: Deploy to production
    run: |
      kubectl set image deployment/myapp \
        myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
        --namespace=production

  - name: Wait for rollout
    run: kubectl rollout status deployment/myapp -n production

Infrastructure as Code

Pulumi Infrastructure as Code

import pulumi import pulumi_aws as aws

VPC

vpc = aws.ec2.Vpc("main-vpc", cidr_block="10.0.0.0/16", enable_dns_hostnames=True, enable_dns_support=True, tags={"Name": "main-vpc"})

Subnets

public_subnet = aws.ec2.Subnet("public-subnet", vpc_id=vpc.id, cidr_block="10.0.1.0/24", availability_zone="us-east-1a", map_public_ip_on_launch=True, tags={"Name": "public-subnet"})

private_subnet = aws.ec2.Subnet("private-subnet", vpc_id=vpc.id, cidr_block="10.0.2.0/24", availability_zone="us-east-1b", tags={"Name": "private-subnet"})

Internet Gateway

igw = aws.ec2.InternetGateway("igw", vpc_id=vpc.id, tags={"Name": "main-igw"})

Route Table

route_table = aws.ec2.RouteTable("public-rt", vpc_id=vpc.id, routes=[ aws.ec2.RouteTableRouteArgs( cidr_block="0.0.0.0/0", gateway_id=igw.id, ) ], tags={"Name": "public-rt"})

Security Group

security_group = aws.ec2.SecurityGroup("web-sg", vpc_id=vpc.id, description="Allow HTTP and HTTPS traffic", ingress=[ aws.ec2.SecurityGroupIngressArgs( protocol="tcp", from_port=80, to_port=80, cidr_blocks=["0.0.0.0/0"], ), aws.ec2.SecurityGroupIngressArgs( protocol="tcp", from_port=443, to_port=443, cidr_blocks=["0.0.0.0/0"], ), ], egress=[ aws.ec2.SecurityGroupEgressArgs( protocol="-1", from_port=0, to_port=0, cidr_blocks=["0.0.0.0/0"], ) ])

EKS Cluster

cluster = aws.eks.Cluster("app-cluster", role_arn=cluster_role.arn, vpc_config=aws.eks.ClusterVpcConfigArgs( subnet_ids=[public_subnet.id, private_subnet.id], security_group_ids=[security_group.id], ))

Export outputs

pulumi.export("vpc_id", vpc.id) pulumi.export("cluster_name", cluster.name) pulumi.export("cluster_endpoint", cluster.endpoint)

Deployment Strategies

from typing import List, Dict import time

class DeploymentStrategy: """Implement various deployment strategies"""

def __init__(self, service_name: str):
    self.service_name = service_name

def blue_green_deployment(self, blue_version: str, green_version: str):
    """Blue-Green deployment"""
    # Deploy green environment
    self.deploy_environment("green", green_version)

    # Run tests on green
    if self.run_tests("green"):
        # Switch traffic to green
        self.switch_traffic("green")

        # Keep blue for rollback
        print(f"Deployment successful. Blue ({blue_version}) kept for rollback.")
    else:
        # Rollback - keep blue active
        print("Tests failed on green. Keeping blue active.")

def canary_deployment(self, current_version: str, new_version: str,
                     canary_percentage: int = 10):
    """Canary deployment"""
    # Deploy canary with small percentage
    self.deploy_canary(new_version, canary_percentage)

    # Monitor metrics
    metrics = self.monitor_canary_metrics(duration_minutes=10)

    if metrics['error_rate'] &#x3C; 0.1 and metrics['latency_p95'] &#x3C; 500:
        # Gradually increase canary traffic
        for percentage in [25, 50, 75, 100]:
            self.update_canary_traffic(percentage)
            time.sleep(300)  # 5 minutes between increases

            if not self.check_health():
                self.rollback(current_version)
                return False

        print(f"Canary deployment successful: {new_version}")
        return True
    else:
        self.rollback(current_version)
        print("Canary deployment failed - rolled back")
        return False

def rolling_deployment(self, version: str, batch_size: int = 1):
    """Rolling deployment"""
    instances = self.get_instances()

    for i in range(0, len(instances), batch_size):
        batch = instances[i:i + batch_size]

        # Update batch
        for instance in batch:
            self.update_instance(instance, version)
            self.wait_for_healthy(instance)

        # Verify batch health
        if not self.check_health():
            print(f"Rolling deployment failed at batch {i//batch_size + 1}")
            return False

    print(f"Rolling deployment successful: {version}")
    return True

def feature_flag_deployment(self, feature_name: str, enabled: bool,
                           rollout_percentage: int = 100):
    """Feature flag based deployment"""
    return {
        'feature': feature_name,
        'enabled': enabled,
        'rollout_percentage': rollout_percentage,
        'targeting': {
            'user_segments': ['beta_users'] if rollout_percentage &#x3C; 100 else ['all']
        }
    }

Configuration Management

from typing import Dict, Any import yaml import json

class ConfigurationManager: """Manage application configuration"""

def __init__(self, environment: str):
    self.environment = environment
    self.config = {}

def load_config(self, config_file: str):
    """Load configuration from file"""
    with open(config_file, 'r') as f:
        if config_file.endswith('.yaml') or config_file.endswith('.yml'):
            self.config = yaml.safe_load(f)
        elif config_file.endswith('.json'):
            self.config = json.load(f)

def get(self, key: str, default: Any = None) -> Any:
    """Get configuration value"""
    keys = key.split('.')
    value = self.config

    for k in keys:
        if isinstance(value, dict):
            value = value.get(k)
        else:
            return default

        if value is None:
            return default

    return value

def merge_environment_config(self, env_config: Dict):
    """Merge environment-specific configuration"""
    self.config = self._deep_merge(self.config, env_config)

def _deep_merge(self, base: Dict, override: Dict) -> Dict:
    """Deep merge two dictionaries"""
    result = base.copy()

    for key, value in override.items():
        if key in result and isinstance(result[key], dict) and isinstance(value, dict):
            result[key] = self._deep_merge(result[key], value)
        else:
            result[key] = value

    return result

def validate_required_keys(self, required_keys: List[str]) -> List[str]:
    """Validate that required configuration keys exist"""
    missing = []

    for key in required_keys:
        if self.get(key) is None:
            missing.append(key)

    return missing

Monitoring and Observability

import logging from opencensus.ext.azure import metrics_exporter from opencensus.stats import aggregation as aggregation_module from opencensus.stats import measure as measure_module from opencensus.stats import stats as stats_module from opencensus.stats import view as view_module from opencensus.tags import tag_map as tag_map_module

class ObservabilityStack: """Implement observability best practices"""

def __init__(self):
    self.logger = self._setup_logging()
    self.stats = stats_module.stats
    self.view_manager = self.stats.view_manager

def _setup_logging(self) -> logging.Logger:
    """Setup structured logging"""
    logger = logging.getLogger(__name__)
    handler = logging.StreamHandler()

    formatter = logging.Formatter(
        '{"time": "%(asctime)s", "level": "%(levelname)s", '
        '"service": "%(name)s", "message": "%(message)s"}'
    )
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

    return logger

def log_with_context(self, level: str, message: str, **context):
    """Log with additional context"""
    log_func = getattr(self.logger, level)
    log_func(message, extra=context)

def track_custom_metric(self, metric_name: str, value: float,
                       tags: Dict[str, str]):
    """Track custom application metric"""
    # Implementation would send to metrics backend
    pass

def create_distributed_trace(self, operation_name: str):
    """Create distributed trace span"""
    # Implementation would use OpenTelemetry or similar
    pass

Best Practices

Culture & Process

Foster collaboration between Dev and Ops
Automate everything possible
Measure and monitor continuously
Practice blameless post-mortems
Share knowledge and documentation
Encourage experimentation
Celebrate successes and learn from failures

CI/CD

Keep builds fast (<10 minutes)
Run tests in parallel
Use pipeline as code
Implement automated rollbacks
Require code review before merge
Use trunk-based development
Deploy small, frequent changes

Infrastructure

Use Infrastructure as Code
Version everything (code, config, infrastructure)
Implement disaster recovery
Practice chaos engineering
Use immutable infrastructure
Automate security scanning
Monitor cloud costs

Anti-Patterns

❌ Manual deployments ❌ Configuration drift ❌ No automated testing ❌ Long-lived feature branches ❌ Blame culture ❌ Siloed teams ❌ Ignoring technical debt

Resources

The Phoenix Project: https://itrevolution.com/the-phoenix-project/
DevOps Handbook: https://itrevolution.com/the-devops-handbook/
State of DevOps Report: https://www.devops-research.com/research.html
GitLab CI/CD: https://docs.gitlab.com/ee/ci/
GitHub Actions: https://docs.github.com/en/actions

devops-expert

Safety Notice

Copy this and send it to your AI assistant to learn

GitHub Actions Example

Pulumi Infrastructure as Code

VPC

Subnets

Internet Gateway

Route Table

Security Group

EKS Cluster

Export outputs

Source Transparency

Related Skills

python-expert

code-review-expert

typescript-expert