CI/CD Expert
Create, debug, and optimize CI/CD pipelines.
Platform Selection
Platform Best For Key Strength
GitHub Actions GitHub-hosted repos, open source Native GitHub integration, marketplace
GitLab CI GitLab repos, self-hosted Built-in registry, Auto DevOps
CircleCI Complex workflows, speed Parallelism, orbs ecosystem
Azure DevOps Microsoft/enterprise, multi-repo Azure integration, YAML templates
Bitbucket Atlassian stack, Jira integration Pipes marketplace, deployments
Quick pick:
-
GitHub repo? → GitHub Actions
-
GitLab repo? → GitLab CI
-
Need extreme parallelism? → CircleCI
-
Azure/Microsoft shop? → Azure DevOps
-
Using Jira/Confluence? → Bitbucket Pipelines
Core Workflows
- Pipeline Creation
Requirements → Platform Selection → Structure → Implementation → Validation
Steps:
-
Identify triggers (push, PR, schedule, manual)
-
Define stages (build, test, deploy)
-
Map environments (dev, staging, prod)
-
Configure secrets and variables
-
Set up caching strategy
-
Implement deployment gates
Minimal viable pipeline:
Every pipeline needs these elements
triggers: # When to run stages: # What to run (in order)
- build # Compile/bundle
- test # Validate
- deploy # Ship (optional) caching: # Speed optimization environment: # Secrets/variables
- Pipeline Review
Review checklist (in order of priority):
Security
-
Secrets not hardcoded
-
Minimal permissions (least privilege)
-
Dependencies pinned (no @latest )
-
Untrusted input sanitized
Reliability
-
Idempotent steps
-
Explicit failure handling
-
Timeouts configured
-
Retry logic for flaky steps
Efficiency
-
Caching implemented
-
Parallelization where possible
-
Conditional execution (skip unchanged)
-
Resource sizing appropriate
Maintainability
-
DRY (reusable workflows/templates)
-
Clear naming conventions
-
Comments on non-obvious logic
- Pipeline Debugging
Diagnostic framework:
Symptom → Category → Platform-specific fix
Common failure categories:
Symptom Likely Cause First Check
"Command not found" Missing dependency Environment setup
"Permission denied" File/secret access Permissions config
"Connection refused" Service not ready Health checks, wait
"Out of memory" Resource limits Runner sizing
"Timeout exceeded" Slow step or deadlock Step isolation
"Exit code 1" Script failure Run locally first
Debug approach:
-
Read the full error (scroll up!)
-
Identify which step failed
-
Check if it works locally
-
Add debug output (set -x , verbose flags)
-
Check platform-specific logs
-
Deployment Strategy Selection
Risk Tolerance ↑ Low │ High │ ┌─────────────────┼─────────────────┐ │ │ │
Slow │ Blue-Green │ Rolling │ Fast │ (safest) │ (balanced) │ │ │ │ ├─────────────────┼─────────────────┤ │ │ │ │ Canary │ Direct │ │ (measured) │ (fastest) │ │ │ │ └─────────────────┴─────────────────┘ Rollback Speed →
Decision guide:
Strategy Use When Avoid When
Blue-Green Zero downtime critical, simple rollback needed Limited infrastructure budget
Canary Need to validate with real traffic, gradual rollout Urgent hotfix, no traffic splitting
Rolling Large clusters, cost-conscious Stateful apps, breaking changes
Direct Dev/test environments, low risk Production, user-facing services
- Build Optimization
Optimization priority:
-
Caching (highest impact)
-
Parallelization
-
Conditional execution
-
Resource sizing
-
Image optimization
Caching hierarchy:
Dependencies (npm, pip) → Build artifacts → Docker layers → Test fixtures ↑ ↑ ↑ ↑ Always cache If build slow If using Docker If tests slow
Cache key strategies:
Dependencies - hash lockfile
key: deps-${{ hashFiles('package-lock.json') }}
Build artifacts - hash source
key: build-${{ hashFiles('src/**') }}
Combined - hash both
key: ${{ runner.os }}-${{ hashFiles('**/lockfiles') }}
Common Patterns
Secret Management
✅ Good: Environment variables
env: API_KEY: ${{ secrets.API_KEY }}
❌ Bad: Inline secrets
run: curl -H "Authorization: hardcoded-secret"
Secret rotation checklist:
-
Secrets stored in platform secret store
-
Secrets scoped to minimum required (repo/org/env)
-
Rotation procedure documented
-
No secrets in logs (mask sensitive output)
Environment Progression
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Dev │ → │Staging │ → │ QA │ → │ Prod │ └─────────┘ └─────────┘ └─────────┘ └─────────┘ Auto Auto Manual Manual deploy deploy approval approval
Artifact Management
Upload
- name: Upload build uses: actions/upload-artifact@v4 with: name: dist path: dist/ retention-days: 7
Download (later job)
- name: Download build uses: actions/download-artifact@v4 with: name: dist
Matrix Builds
Test across multiple versions
strategy: matrix: node: [18, 20, 22] os: [ubuntu-latest, macos-latest] fail-fast: false # Continue other jobs if one fails
Anti-Patterns
❌ Secrets in Logs
BAD
run: echo "Deploying with key $API_KEY"
GOOD
run: echo "Deploying..." && deploy --key "$API_KEY"
❌ No Caching
BAD: Downloads deps every run
- run: npm install
GOOD: Cache node_modules
- uses: actions/cache@v4 with: path: node_modules key: ${{ hashFiles('package-lock.json') }}
- run: npm ci
❌ Mutable Tags
BAD: Could change anytime
uses: actions/checkout@master
GOOD: Pinned version
uses: actions/checkout@v4
❌ Overly Broad Triggers
BAD: Runs on every push to every branch
on: push
GOOD: Specific triggers
on: push: branches: [main] pull_request: branches: [main]
❌ Sequential When Parallel Works
BAD: 15 min total
jobs: lint: ... test: needs: lint # Waits for lint ...
GOOD: 10 min total (parallel)
jobs: lint: ... test: ... # Runs alongside lint
Troubleshooting Quick Reference
Error Pattern Platform Solution
EACCES: permission denied
All chmod +x script.sh or check file ownership
rate limit exceeded
GitHub Use GITHUB_TOKEN , implement backoff
no space left on device
All Clean workspace, use smaller runner
SSL certificate problem
All Update CA certs, check proxy settings
container not found
Docker-based Check registry auth, image exists
workflow not triggered
GitHub Check trigger conditions, branch patterns
job timeout
All Increase timeout, optimize slow steps
Integration Points
With version control:
-
Branch protection rules
-
Required status checks
-
Auto-merge on success
With deployment targets:
-
Cloud providers (AWS, GCP, Azure)
-
Container registries (Docker Hub, ECR, GCR)
-
Kubernetes clusters
-
Serverless (Lambda, Cloud Functions)
With monitoring:
-
Build metrics (duration, success rate)
-
Deployment tracking
-
Error alerting (Slack, PagerDuty)
Skill Usage
Creating a pipeline:
-
Identify platform from repo hosting
-
Read platform-specific reference
-
Follow creation workflow above
-
Validate with platform's linter
Debugging a failure:
-
Categorize the error (table above)
-
Read debugging-guide.md for detailed steps
-
Check platform-specific quirks
Implementing deployments:
-
Choose strategy using decision guide
-
Read deployment-strategies.md
-
Implement with platform-specific syntax
Security audit:
-
Use review checklist above
-
Read security-checklist.md for comprehensive review
-
Generate ohno tasks for findings
References:
-
references/github-actions.md — GitHub Actions syntax, patterns, marketplace actions
-
references/gitlab-ci.md — GitLab CI syntax, Auto DevOps, runners
-
references/circleci.md — CircleCI orbs, parallelism, caching
-
references/azure-devops.md — Azure Pipelines, templates, stages
-
references/bitbucket-pipelines.md — Bitbucket Pipelines, pipes, deployments
-
references/deployment-strategies.md — Blue-green, canary, rolling implementations
-
references/debugging-guide.md — Platform-specific debugging techniques
-
references/security-checklist.md — OWASP CI/CD security, supply chain