devops-expert

Expert guidance for DevOps practices, culture, CI/CD pipelines, infrastructure automation, and operational excellence.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "devops-expert" with this command: npx skills add personamanagmentlayer/pcl/personamanagmentlayer-pcl-devops-expert

DevOps Expert

Expert guidance for DevOps practices, culture, CI/CD pipelines, infrastructure automation, and operational excellence.

Core Concepts

DevOps Culture

  • Collaboration and communication

  • Shared responsibility

  • Continuous improvement

  • Breaking down silos

  • Blameless culture

  • Measuring everything

Automation

  • Infrastructure as Code (IaC)

  • Configuration management

  • Deployment automation

  • Testing automation

  • Monitoring automation

  • Self-service platforms

CI/CD

  • Continuous Integration

  • Continuous Delivery

  • Continuous Deployment

  • Pipeline as Code

  • Artifact management

  • Release strategies

CI/CD Pipeline

GitHub Actions Example

name: CI/CD Pipeline

on: push: branches: [main, develop] pull_request: branches: [main]

env: REGISTRY: ghcr.io IMAGE_NAME: ${{ github.repository }}

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3

  - name: Setup Node.js
    uses: actions/setup-node@v3
    with:
      node-version: '18'
      cache: 'npm'

  - name: Install dependencies
    run: npm ci

  - name: Run linting
    run: npm run lint

  - name: Run tests
    run: npm test

  - name: Run security scan
    run: npm audit

  - name: Upload coverage
    uses: codecov/codecov-action@v3

build: needs: test runs-on: ubuntu-latest permissions: contents: read packages: write

steps:
  - uses: actions/checkout@v3

  - name: Log in to Container Registry
    uses: docker/login-action@v2
    with:
      registry: ${{ env.REGISTRY }}
      username: ${{ github.actor }}
      password: ${{ secrets.GITHUB_TOKEN }}

  - name: Extract metadata
    id: meta
    uses: docker/metadata-action@v4
    with:
      images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}

  - name: Build and push Docker image
    uses: docker/build-push-action@v4
    with:
      context: .
      push: true
      tags: ${{ steps.meta.outputs.tags }}
      labels: ${{ steps.meta.outputs.labels }}

deploy-staging: needs: build if: github.ref == 'refs/heads/develop' runs-on: ubuntu-latest environment: staging

steps:
  - name: Deploy to staging
    run: |
      kubectl set image deployment/myapp \
        myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
        --namespace=staging

  - name: Wait for rollout
    run: kubectl rollout status deployment/myapp -n staging

  - name: Run smoke tests
    run: npm run test:smoke

deploy-production: needs: build if: github.ref == 'refs/heads/main' runs-on: ubuntu-latest environment: production

steps:
  - name: Deploy to production
    run: |
      kubectl set image deployment/myapp \
        myapp=${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ github.sha }} \
        --namespace=production

  - name: Wait for rollout
    run: kubectl rollout status deployment/myapp -n production

Infrastructure as Code

Pulumi Infrastructure as Code

import pulumi import pulumi_aws as aws

VPC

vpc = aws.ec2.Vpc("main-vpc", cidr_block="10.0.0.0/16", enable_dns_hostnames=True, enable_dns_support=True, tags={"Name": "main-vpc"})

Subnets

public_subnet = aws.ec2.Subnet("public-subnet", vpc_id=vpc.id, cidr_block="10.0.1.0/24", availability_zone="us-east-1a", map_public_ip_on_launch=True, tags={"Name": "public-subnet"})

private_subnet = aws.ec2.Subnet("private-subnet", vpc_id=vpc.id, cidr_block="10.0.2.0/24", availability_zone="us-east-1b", tags={"Name": "private-subnet"})

Internet Gateway

igw = aws.ec2.InternetGateway("igw", vpc_id=vpc.id, tags={"Name": "main-igw"})

Route Table

route_table = aws.ec2.RouteTable("public-rt", vpc_id=vpc.id, routes=[ aws.ec2.RouteTableRouteArgs( cidr_block="0.0.0.0/0", gateway_id=igw.id, ) ], tags={"Name": "public-rt"})

Security Group

security_group = aws.ec2.SecurityGroup("web-sg", vpc_id=vpc.id, description="Allow HTTP and HTTPS traffic", ingress=[ aws.ec2.SecurityGroupIngressArgs( protocol="tcp", from_port=80, to_port=80, cidr_blocks=["0.0.0.0/0"], ), aws.ec2.SecurityGroupIngressArgs( protocol="tcp", from_port=443, to_port=443, cidr_blocks=["0.0.0.0/0"], ), ], egress=[ aws.ec2.SecurityGroupEgressArgs( protocol="-1", from_port=0, to_port=0, cidr_blocks=["0.0.0.0/0"], ) ])

EKS Cluster

cluster = aws.eks.Cluster("app-cluster", role_arn=cluster_role.arn, vpc_config=aws.eks.ClusterVpcConfigArgs( subnet_ids=[public_subnet.id, private_subnet.id], security_group_ids=[security_group.id], ))

Export outputs

pulumi.export("vpc_id", vpc.id) pulumi.export("cluster_name", cluster.name) pulumi.export("cluster_endpoint", cluster.endpoint)

Deployment Strategies

from typing import List, Dict import time

class DeploymentStrategy: """Implement various deployment strategies"""

def __init__(self, service_name: str):
    self.service_name = service_name

def blue_green_deployment(self, blue_version: str, green_version: str):
    """Blue-Green deployment"""
    # Deploy green environment
    self.deploy_environment("green", green_version)

    # Run tests on green
    if self.run_tests("green"):
        # Switch traffic to green
        self.switch_traffic("green")

        # Keep blue for rollback
        print(f"Deployment successful. Blue ({blue_version}) kept for rollback.")
    else:
        # Rollback - keep blue active
        print("Tests failed on green. Keeping blue active.")

def canary_deployment(self, current_version: str, new_version: str,
                     canary_percentage: int = 10):
    """Canary deployment"""
    # Deploy canary with small percentage
    self.deploy_canary(new_version, canary_percentage)

    # Monitor metrics
    metrics = self.monitor_canary_metrics(duration_minutes=10)

    if metrics['error_rate'] < 0.1 and metrics['latency_p95'] < 500:
        # Gradually increase canary traffic
        for percentage in [25, 50, 75, 100]:
            self.update_canary_traffic(percentage)
            time.sleep(300)  # 5 minutes between increases

            if not self.check_health():
                self.rollback(current_version)
                return False

        print(f"Canary deployment successful: {new_version}")
        return True
    else:
        self.rollback(current_version)
        print("Canary deployment failed - rolled back")
        return False

def rolling_deployment(self, version: str, batch_size: int = 1):
    """Rolling deployment"""
    instances = self.get_instances()

    for i in range(0, len(instances), batch_size):
        batch = instances[i:i + batch_size]

        # Update batch
        for instance in batch:
            self.update_instance(instance, version)
            self.wait_for_healthy(instance)

        # Verify batch health
        if not self.check_health():
            print(f"Rolling deployment failed at batch {i//batch_size + 1}")
            return False

    print(f"Rolling deployment successful: {version}")
    return True

def feature_flag_deployment(self, feature_name: str, enabled: bool,
                           rollout_percentage: int = 100):
    """Feature flag based deployment"""
    return {
        'feature': feature_name,
        'enabled': enabled,
        'rollout_percentage': rollout_percentage,
        'targeting': {
            'user_segments': ['beta_users'] if rollout_percentage < 100 else ['all']
        }
    }

Configuration Management

from typing import Dict, Any import yaml import json

class ConfigurationManager: """Manage application configuration"""

def __init__(self, environment: str):
    self.environment = environment
    self.config = {}

def load_config(self, config_file: str):
    """Load configuration from file"""
    with open(config_file, 'r') as f:
        if config_file.endswith('.yaml') or config_file.endswith('.yml'):
            self.config = yaml.safe_load(f)
        elif config_file.endswith('.json'):
            self.config = json.load(f)

def get(self, key: str, default: Any = None) -> Any:
    """Get configuration value"""
    keys = key.split('.')
    value = self.config

    for k in keys:
        if isinstance(value, dict):
            value = value.get(k)
        else:
            return default

        if value is None:
            return default

    return value

def merge_environment_config(self, env_config: Dict):
    """Merge environment-specific configuration"""
    self.config = self._deep_merge(self.config, env_config)

def _deep_merge(self, base: Dict, override: Dict) -> Dict:
    """Deep merge two dictionaries"""
    result = base.copy()

    for key, value in override.items():
        if key in result and isinstance(result[key], dict) and isinstance(value, dict):
            result[key] = self._deep_merge(result[key], value)
        else:
            result[key] = value

    return result

def validate_required_keys(self, required_keys: List[str]) -> List[str]:
    """Validate that required configuration keys exist"""
    missing = []

    for key in required_keys:
        if self.get(key) is None:
            missing.append(key)

    return missing

Monitoring and Observability

import logging from opencensus.ext.azure import metrics_exporter from opencensus.stats import aggregation as aggregation_module from opencensus.stats import measure as measure_module from opencensus.stats import stats as stats_module from opencensus.stats import view as view_module from opencensus.tags import tag_map as tag_map_module

class ObservabilityStack: """Implement observability best practices"""

def __init__(self):
    self.logger = self._setup_logging()
    self.stats = stats_module.stats
    self.view_manager = self.stats.view_manager

def _setup_logging(self) -> logging.Logger:
    """Setup structured logging"""
    logger = logging.getLogger(__name__)
    handler = logging.StreamHandler()

    formatter = logging.Formatter(
        '{"time": "%(asctime)s", "level": "%(levelname)s", '
        '"service": "%(name)s", "message": "%(message)s"}'
    )
    handler.setFormatter(formatter)
    logger.addHandler(handler)
    logger.setLevel(logging.INFO)

    return logger

def log_with_context(self, level: str, message: str, **context):
    """Log with additional context"""
    log_func = getattr(self.logger, level)
    log_func(message, extra=context)

def track_custom_metric(self, metric_name: str, value: float,
                       tags: Dict[str, str]):
    """Track custom application metric"""
    # Implementation would send to metrics backend
    pass

def create_distributed_trace(self, operation_name: str):
    """Create distributed trace span"""
    # Implementation would use OpenTelemetry or similar
    pass

Best Practices

Culture & Process

  • Foster collaboration between Dev and Ops

  • Automate everything possible

  • Measure and monitor continuously

  • Practice blameless post-mortems

  • Share knowledge and documentation

  • Encourage experimentation

  • Celebrate successes and learn from failures

CI/CD

  • Keep builds fast (<10 minutes)

  • Run tests in parallel

  • Use pipeline as code

  • Implement automated rollbacks

  • Require code review before merge

  • Use trunk-based development

  • Deploy small, frequent changes

Infrastructure

  • Use Infrastructure as Code

  • Version everything (code, config, infrastructure)

  • Implement disaster recovery

  • Practice chaos engineering

  • Use immutable infrastructure

  • Automate security scanning

  • Monitor cloud costs

Anti-Patterns

❌ Manual deployments ❌ Configuration drift ❌ No automated testing ❌ Long-lived feature branches ❌ Blame culture ❌ Siloed teams ❌ Ignoring technical debt

Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

python-expert

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

code-review-expert

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

typescript-expert

No summary provided by upstream source.

Repository SourceNeeds Review