Cloud Devops Expert

Core Services:

Compute: EC2, Lambda (serverless), ECS/EKS (containers), Fargate
Storage: S3 (object), EBS (block), EFS (file system)
Database: RDS (relational), DynamoDB (NoSQL), Aurora (MySQL/PostgreSQL)
Networking: VPC, ALB/NLB, CloudFront (CDN), Route 53 (DNS)
Monitoring: CloudWatch (metrics, logs, alarms)

Best Practices:

Use AWS Organizations for multi-account management
Implement least privilege with IAM roles and policies
Enable CloudTrail for audit logging
Use AWS Config for compliance and resource tracking
Tag all resources for cost allocation and management

GCP (Google Cloud Platform) Patterns

Core Services:

Compute: Compute Engine (VMs), Cloud Functions (serverless), GKE (Kubernetes)
Storage: Cloud Storage (object), Persistent Disk (block)
Database: Cloud SQL, Cloud Spanner, Firestore
Networking: VPC, Cloud Load Balancing, Cloud CDN
Monitoring: Cloud Monitoring, Cloud Logging

Best Practices:

Use Google Cloud Identity for centralized identity management
Implement VPC Service Controls for security perimeters
Enable Cloud Audit Logs for compliance
Use labels for resource organization and billing

Azure Patterns

Core Services:

Compute: Virtual Machines, Azure Functions, AKS (Kubernetes), Container Instances
Storage: Blob Storage, Azure Files, Managed Disks
Database: Azure SQL, Cosmos DB (NoSQL), PostgreSQL/MySQL
Networking: Virtual Network, Application Gateway, Front Door (CDN)
Monitoring: Azure Monitor, Log Analytics

Best Practices:

Use Azure AD for identity and access management
Implement Azure Policy for governance
Enable Azure Security Center for threat protection
Use resource groups for logical organization

Terraform Best Practices

Project Structure:

terraform/ ├── environments/ │ ├── dev/ │ │ ├── main.tf │ │ ├── variables.tf │ │ └── terraform.tfvars │ ├── staging/ │ └── prod/ ├── modules/ │ ├── vpc/ │ ├── eks/ │ └── rds/ └── global/ └── backend.tf

Code Organization:

Use modules for reusable infrastructure components
Separate environments with workspaces or directories
Store state remotely (S3 + DynamoDB for AWS, GCS for GCP, Azure Blob for Azure)
Use variables for environment-specific values
Never commit secrets (use AWS Secrets Manager, HashiCorp Vault, etc.)

Terraform Workflow:

Initialize

terraform init

Plan (review changes)

terraform plan -out=tfplan

Apply (execute changes)

terraform apply tfplan

Destroy (when needed)

terraform destroy

Best Practices:

Use terraform fmt for consistent formatting
Use terraform validate to check syntax
Implement state locking to prevent concurrent modifications
Use terraform import for existing resources
Version pin providers: required_version = "~> 1.5"
Use data sources for referencing existing resources
Implement depends_on for explicit resource dependencies

Kubernetes Deployment Patterns

Deployment Strategies:

Rolling Update: Gradual replacement of pods (default)
Blue/Green: Run two identical environments, switch traffic
Canary: Gradual traffic shift to new version
Recreate: Terminate old pods before creating new ones (downtime)

Resource Management:

apiVersion: apps/v1 kind: Deployment metadata: name: myapp spec: replicas: 3 selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: myapp image: myapp:v1.0.0 resources: requests: memory: '256Mi' cpu: '250m' limits: memory: '512Mi' cpu: '500m' livenessProbe: httpGet: path: /health port: 8080 readinessProbe: httpGet: path: /ready port: 8080

Best Practices:

Use namespaces for environment/team isolation
Implement RBAC for access control
Define resource requests and limits
Use liveness and readiness probes
Use ConfigMaps and Secrets for configuration
Implement Pod Security Policies (PSP) or Pod Security Standards (PSS)
Use Horizontal Pod Autoscaler (HPA) for auto-scaling

CI/CD Pipeline Patterns

GitHub Actions Example:

name: CI/CD Pipeline

on: push: branches: [main, develop] pull_request: branches: [main]

jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Run tests run: npm test

build: needs: test runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Build Docker image run: docker build -t myapp:${{ github.sha }} . - name: Push to registry run: docker push myapp:${{ github.sha }}

deploy: needs: build runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' steps: - name: Deploy to Kubernetes run: kubectl set image deployment/myapp myapp=myapp:${{ github.sha }}

Best Practices:

Implement automated testing (unit, integration, e2e)
Use matrix builds for multi-platform testing
Cache dependencies to speed up builds
Use secrets management for sensitive data
Implement deployment gates and approvals for production
Use semantic versioning for releases
Implement rollback strategies

Infrastructure as Code (IaC) Principles

Version Control:

Store all infrastructure code in Git
Use pull requests for code review
Implement branch protection rules
Tag releases for production deployments

Testing:

Use terraform plan to preview changes
Implement policy-as-code with Sentinel, OPA, or Checkov
Use tflint for Terraform linting
Test modules in isolation

Documentation:

Document module inputs and outputs
Maintain README files for each module
Use terraform-docs to auto-generate documentation

Monitoring and Observability

The Three Pillars:

Metrics (Prometheus + Grafana)

Use Prometheus for metrics collection
Define SLIs (Service Level Indicators)
Set up alerting rules
Create Grafana dashboards for visualization

Logs (ELK Stack, CloudWatch, Cloud Logging)

Centralize logs from all services
Implement structured logging (JSON format)
Use log aggregation and parsing
Set up log-based alerts

Traces (Jaeger, Zipkin, X-Ray)

Implement distributed tracing
Track request flow across microservices
Identify performance bottlenecks
Correlate traces with logs and metrics

Observability Best Practices:

Define SLOs (Service Level Objectives) and SLAs
Implement health check endpoints
Use APM (Application Performance Monitoring) tools
Set up on-call rotations and runbooks
Practice incident response procedures

Container Orchestration (Kubernetes)

Helm Charts:

Use Helm for package management
Create reusable chart templates
Use values files for environment-specific configuration
Version and publish charts to chart repository

Kubernetes Operators:

Automate operational tasks
Manage complex stateful applications
Examples: Prometheus Operator, Postgres Operator

Service Mesh (Istio, Linkerd):

Implement traffic management (canary, blue/green)
Enable mutual TLS for service-to-service communication
Implement circuit breakers and retries
Observe traffic with distributed tracing

Cost Optimization

AWS Cost Optimization:

Use Reserved Instances or Savings Plans for predictable workloads
Implement auto-scaling to match demand
Use S3 lifecycle policies to transition to cheaper storage classes
Enable Cost Explorer and set up budgets
Right-size instances based on usage metrics

Multi-Cloud Cost Management:

Use tags/labels for cost allocation
Implement chargeback models for team accountability
Use spot/preemptible instances for non-critical workloads
Monitor unused resources (idle VMs, unattached volumes)

Cloudflare Developer Platform

Cloudflare Workers & Pages:

Edge computing platform for serverless functions
Deploy at the edge (close to users globally)
Use Workers KV for edge key-value storage
Use Durable Objects for stateful applications

Cloudflare Primitives:

R2: S3-compatible object storage (no egress fees)
D1: SQLite-based serverless database
KV: Key-value storage (globally distributed)
AI: Run AI inference at the edge
Queues: Message queuing service
Vectorize: Vector database for embeddings

Configuration (wrangler.toml):

name = "my-worker" main = "src/index.ts" compatibility_date = "2024-01-01"

[[kv_namespaces]] binding = "MY_KV" id = "xxx"

[[r2_buckets]] binding = "MY_BUCKET" bucket_name = "my-bucket"

[[d1_databases]] binding = "DB" database_name = "my-db" database_id = "xxx"

Consolidated Skills

This expert skill consolidates 1 individual skills:

cloudflare-developer-tools-rule

Related Skills

docker-compose
Container orchestration and multi-container application management

Memory Protocol (MANDATORY)

Before starting:

cat .claude/context/memory/learnings.md

After completing: Record any new patterns or exceptions discovered.

ASSUME INTERRUPTION: Your context may reset. If it's not in memory, it didn't happen.

cloud-devops-expert

Safety Notice

Copy this and send it to your AI assistant to learn

Initialize

Plan (review changes)

Apply (execute changes)

Destroy (when needed)

Source Transparency

Related Skills

pyqt6-ui-development-rules

code-analyzer

gcloud-cli