DevOps Engineering — Quick Reference
This skill equips teams with actionable templates, checklists, and patterns for building self-service platforms, automating infrastructure with GitOps, deploying securely with DevSecOps, scaling with Kubernetes, ensuring reliability through SRE practices, and operating production systems with strong observability.
Modern baseline (2026): IaC (Terraform/OpenTofu/Pulumi), GitOps (Argo CD/Flux), Kubernetes (follow upstream supported releases), OpenTelemetry + Prometheus/Grafana, supply-chain security (SBOM + signing + provenance), policy-as-code (OPA/Gatekeeper or Kyverno), and eBPF-powered networking/security/observability (e.g., Cilium + Tetragon).
Quick Reference
Task Tool/Framework Command When to Use
Infrastructure as Code Terraform / OpenTofu terraform plan && terraform apply
Provision cloud resources declaratively
GitOps Deployment Argo CD / Flux argocd app sync myapp
Continuous reconciliation, declarative deployments
Container Build Docker Engine docker build -t app:v1 .
Package applications with dependencies
Kubernetes Deployment kubectl / Helm (Kubernetes) kubectl apply -f deploy.yaml / helm upgrade app ./chart
Deploy to K8s cluster, manage releases
CI/CD Pipeline GitHub Actions Define workflow in .github/workflows/ci.yml
Automated testing, building, deploying
Security Scanning Trivy / Falco / Tetragon trivy image myapp:latest
Vulnerability scanning, runtime security, eBPF enforcement
Monitoring & Alerts Prometheus + Grafana Configure ServiceMonitor and AlertManager Observability, SLO tracking, incident alerts
Load Testing k6 / Locust k6 run load-test.js
Performance validation, capacity planning
Incident Response PagerDuty / Opsgenie Configure escalation policies On-call management, automated escalation
Platform Engineering Backstage / Port Deploy internal developer portal Self-service infrastructure, golden paths
Decision Tree: Choosing DevOps Approach
What do you need to accomplish? ├─ Infrastructure provisioning? │ ├─ Cloud-agnostic → Terraform or OpenTofu (OSS fork) │ ├─ Programming-first → Pulumi (TypeScript/Python/Go) │ ├─ AWS-specific → CloudFormation or Terraform/OpenTofu │ ├─ GCP-specific → Deployment Manager or Terraform/OpenTofu │ └─ Azure-specific → ARM/Bicep or Terraform/OpenTofu │ ├─ Application deployment? │ ├─ Kubernetes cluster? │ │ ├─ Simple deploy → kubectl apply -f manifests/ │ │ ├─ Complex app → Helm charts │ │ └─ GitOps workflow → ArgoCD or FluxCD │ └─ Serverless? │ ├─ AWS → Lambda + SAM/Serverless Framework │ ├─ GCP → Cloud Functions │ └─ Azure → Azure Functions │ ├─ CI/CD pipeline setup? │ ├─ GitHub-based → GitHub Actions (template-github-actions.md) │ ├─ GitLab-based → GitLab CI │ ├─ Enterprise → Jenkins or Tekton │ └─ Security-first → Add SAST/DAST/SCA scans (template-ci-cd.md) │ ├─ Observability & monitoring? │ ├─ Metrics → Prometheus + Grafana │ ├─ Distributed tracing → Jaeger or OpenTelemetry │ ├─ Logs → Loki or ELK stack │ ├─ eBPF-based → Cilium + Hubble (sidecarless) │ └─ Unified platform → Datadog or New Relic │ ├─ Incident management? │ ├─ On-call rotation → PagerDuty or Opsgenie │ ├─ Postmortem → template-postmortem.md │ └─ Communication → template-incident-comm.md │ ├─ Platform engineering? │ ├─ Self-service → Backstage or Port (internal developer portal) │ ├─ Policy enforcement → OPA/Gatekeeper │ └─ Golden paths → Template repositories + automation │ └─ Security hardening? ├─ Container scanning → Trivy or Grype ├─ Runtime security → Falco or Sysdig ├─ Secrets management → HashiCorp Vault or cloud-native KMS └─ Compliance → CIS Benchmarks, template-security-hardening.md
When to Use This Skill
Claude should invoke this skill when users request:
-
Platform engineering patterns (self-service developer platforms, internal tools)
-
GitOps workflows (ArgoCD, FluxCD, declarative infrastructure management)
-
Infrastructure as Code patterns (Terraform, K8s manifests, policy as code)
-
CI/CD pipelines with DevSecOps (GitHub Actions, security scanning, SAST/DAST/SCA)
-
SRE incident management, escalation, and postmortem templates
-
eBPF-based observability (Cilium, Hubble, kernel-level insights, OpenTelemetry)
-
Kubernetes operational patterns (day-2 operations, resource management, workload placement)
-
Cloud-native monitoring (Prometheus, Grafana, unified observability platforms)
-
Team workflow, communication, handover guides, and runbooks
Resources (Best Practices Guides)
Operational best practices by domain:
-
DevOps/SRE Operations: references/devops-best-practices.md - Core patterns for safe infrastructure changes, deployments, and incident response
-
Platform Engineering: references/platform-engineering-patterns.md - Self-service platforms, golden paths, internal developer portals, policy as code
-
GitOps Workflows: references/gitops-workflows.md - Continuous reconciliation, multi-environment promotion, ArgoCD/FluxCD patterns, progressive delivery
-
SRE Incident Management: references/sre-incident-management.md - Severity classification, escalation procedures, blameless postmortems, alert correlation, and runbooks
-
Operational Standards: references/operational-patterns.md - Platform engineering blueprints, CI/CD safety, SLOs, and reliability drills
-
AIOps: references/aiops-patterns.md - Self-healing systems, automated operations, AI-assisted analysis
Templates (Copy-Paste Ready)
Production templates organized by tech stack:
AWS Cloud
-
assets/aws/template-aws-ops.md - AWS service operations and best practices
-
assets/aws/template-aws-terraform.md - Terraform modules for AWS infrastructure
-
assets/aws/template-cost-optimization.md - AWS cost optimization strategies
GCP Cloud
-
assets/gcp/template-gcp-ops.md - GCP service operations
-
assets/gcp/template-gcp-terraform.md - Terraform modules for GCP
Azure Cloud
- assets/azure/template-azure-ops.md - Azure service operations
Kubernetes
-
assets/kubernetes/template-kubernetes-ops.md - Day-to-day K8s operations
-
assets/kubernetes/template-ha-dr.md - High availability and disaster recovery
-
assets/kubernetes/template-platform-api.md - Platform API patterns
-
assets/kubernetes/template-k8s-deploy.yaml - Deployment manifests
Docker
- assets/docker/template-docker-ops.md - Container build, security, and operations
Kafka
- assets/kafka/template-kafka-ops.md - Kafka cluster operations and streaming
Terraform & IaC
-
assets/terraform-iac/template-iac-terraform.md - Infrastructure as Code patterns
-
assets/terraform-iac/template-module.md - Reusable Terraform modules
-
assets/terraform-iac/template-env-promotion.md - Environment promotion strategies
CI/CD Pipelines
-
assets/cicd-pipelines/template-ci-cd.md - General CI/CD patterns
-
assets/cicd-pipelines/template-github-actions.md - GitHub Actions workflows
-
assets/cicd-pipelines/template-gitops.md - GitOps deployment patterns
-
assets/cicd-pipelines/template-release-safety.md - Safe release practices
Monitoring & Observability
-
assets/monitoring-observability/template-slo.md - Service level objectives
-
assets/monitoring-observability/template-alert-rules.md - Alert configuration
-
assets/monitoring-observability/template-observability-slo.md - Observability patterns
-
assets/monitoring-observability/template-loadtest-perf.md - Load testing and performance
Incident Response
-
assets/incident-response/template-postmortem.md - Incident postmortems
-
assets/incident-response/template-runbook-starter.md - Runbook starter template
-
assets/incident-response/template-incident-comm.md - Incident communication
-
assets/incident-response/template-incident-response.md - Incident response procedures
Security
- assets/security/template-security-hardening.md - Security hardening checklists
Shared Utilities
Centralized patterns from software-clean-code-standard — extract, don't duplicate:
-
config-validation.md — Zod 3.24+, secrets management (Vault, 1Password, Doppler)
-
resilience-utilities.md — p-retry v6, circuit breaker, OTel spans
-
logging-utilities.md — pino v9 + OpenTelemetry integration
-
observability-utilities.md — OpenTelemetry SDK, tracing, metrics
Related Skills
Operations & Infrastructure:
-
../qa-resilience/SKILL.md — Resilience, chaos engineering, and failure handling patterns
-
../data-sql-optimization/SKILL.md — Database tuning, high availability, and migrations
-
../qa-observability/SKILL.md — Monitoring, tracing, profiling, and performance optimization
-
../qa-debugging/SKILL.md — Production debugging, log analysis, and root cause investigation
Security & Compliance:
- ../software-security-appsec/SKILL.md — Application-layer security patterns and OWASP best practices
Software Development:
-
../software-backend/SKILL.md — Service-level design and integration patterns
-
../software-architecture-design/SKILL.md — System design, scalability, and architectural patterns
-
../dev-api-design/SKILL.md — RESTful API design and versioning
-
../dev-git-workflow/SKILL.md — Git branching strategies and CI/CD integration
Optional: AI/Automation (Related Skills):
- ../ai-mlops/SKILL.md — ML model deployment, monitoring, and lifecycle management
Cost Governance & Capacity Planning
assets/cost-governance/template-cost-governance.md — Production cost control for cloud infrastructure.
Key Sections
-
Cost Governance Framework — Tagging strategy, budget alerts, anomaly detection
-
Cloud Cost Optimization — Right-sizing, reserved capacity, storage tiering
-
Kubernetes Cost Control — Resource requests/limits, quotas, autoscaler config
-
Capacity Planning — Utilization baseline, growth projections, scaling triggers
-
FinOps Practices — Monthly review agenda, optimization workflow
Do / Avoid
Do
-
Tag all resources at creation time
-
Set budget alerts before hitting limits
-
Review right-sizing recommendations monthly
-
Use spot/preemptible for fault-tolerant workloads
-
Set Kubernetes resource requests on all pods
-
Enable cluster autoscaler with scale-down
-
Document capacity planning assumptions
-
Run blameless postmortems after every SEV1/2
Avoid
Anti-Pattern Problem Fix
No cost tags Can't attribute spend Enforce tags in CI/CD
Dev runs 24/7 ~70% waste Scheduled shutdown
Over-provisioned Paying for idle capacity Monthly right-sizing review
No reservations On-demand premium 60-70% reserved coverage target
Alert fatigue Real issues missed SLO-based alerting, tuned thresholds
Snowflake infra Unreproducible, undocumented Everything in Terraform/IaC
Clickops drift Config outside IaC Enforce GitOps reconciliation
No postmortems Same incidents repeat Blameless postmortem for SEV1/2
Optional: AI/Automation (AIOps)
AI can assist with analysis and triage, but infrastructure/cost/incident decisions require human approval and an audit trail.
See references/aiops-patterns.md for self-healing systems, automated operations, AI-assisted analysis, and bounded claims.
Operational Deep Dives
See references/operational-patterns.md for:
-
Platform engineering blueprints and GitOps reconciliation checklists
-
DevSecOps CI/CD gates, SLO/SLI playbooks, and rollout verification steps
-
Observability patterns (eBPF), incident noise reduction, and reliability drills
External Resources
See data/sources.json for curated sources organized by tech stack:
-
Cloud Platforms: AWS, GCP, Azure documentation and best practices
-
Container Orchestration: Kubernetes, Helm, Kustomize, Docker
-
Infrastructure as Code: Terraform, OpenTofu, Pulumi, CloudFormation, ARM templates
-
CI/CD & GitOps: GitHub Actions, GitLab CI, Jenkins, ArgoCD, FluxCD
-
Streaming: Apache Kafka, Confluent, Strimzi
-
Monitoring: Prometheus, Grafana, Datadog, OpenTelemetry, Jaeger, Cilium/Hubble, Tetragon
-
SRE: Google SRE books, incident response patterns
-
Security: OWASP DevSecOps, CIS Benchmarks, Trivy, Falco
-
Tools: kubectl, k9s, stern, Cosign, Syft, Terragrunt
Use this skill as a hub for safe, modern, and production-grade DevOps patterns. All templates and patterns are operational—no theory or book summaries.
Trend Awareness Protocol
When users ask recommendation questions about DevOps, platform engineering, or cloud infrastructure, validate time-sensitive details (versions, deprecations, licensing, major releases) against primary sources.
Trigger Conditions
-
"What's the best tool for [Kubernetes/IaC/CI-CD/monitoring]?"
-
"What should I use for [container orchestration/GitOps/observability]?"
-
"What's the latest in DevOps/platform engineering?"
-
"Current best practices for [Terraform/ArgoCD/Prometheus]?"
-
"Is [tool/approach] still relevant in 2026?"
-
"[Kubernetes] vs [alternative]?" or "[ArgoCD] vs [FluxCD]?"
-
"Best cloud provider for [use case]?"
-
"What orchestration/monitoring tool should I use?"
Minimum Verification (Preferred Order)
-
Check the official docs + release notes linked in data/sources.json for the specific tools you recommend.
-
If internet access is available, confirm recent releases, breaking changes, and deprecations from those release pages.
-
If internet access is not available, state that versions may have changed and focus on stable selection criteria (operational fit, ecosystem, maturity, team skills, compliance).
What to Report
After searching, provide:
-
Current landscape: What tools/approaches are popular NOW (not 6 months ago)
-
Emerging trends: New tools, patterns, or practices gaining traction
-
Deprecated/declining: Tools/approaches losing relevance or support
-
Recommendation: Based on fresh data, not just static knowledge
Example Topics (verify with fresh search)
-
Kubernetes versions and ecosystem tools (1.33+, Cilium, Gateway API)
-
Infrastructure as Code (Terraform, OpenTofu, Pulumi, CDK)
-
GitOps platforms (ArgoCD, FluxCD, Codefresh)
-
Observability stacks (OpenTelemetry, Grafana stack, Datadog)
-
Platform engineering tools (Backstage, Port, Kratix)
-
CI/CD platforms (GitHub Actions, GitLab CI, Dagger)
-
Cloud-native security (Falco, Trivy, policy engines)
Fact-Checking
-
Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.
-
Prefer primary sources; report source links and dates for volatile information.
-
If web access is unavailable, state the limitation and mark guidance as unverified.