Self-Service Infrastructure
Patterns for enabling developers to provision infrastructure without tickets, while maintaining governance and control.
When to Use This Skill
-
Designing infrastructure self-service capabilities
-
Creating reusable Terraform/Pulumi modules
-
Building environment provisioning systems
-
Implementing infrastructure guardrails
-
Reducing infrastructure request bottlenecks
-
Balancing developer autonomy with governance
Self-Service Fundamentals
What is Self-Service Infrastructure?
Self-Service Infrastructure: Enabling developers to provision and manage infrastructure directly, without filing tickets or waiting for ops teams.
Traditional Model: ┌─────────────────────────────────────────────────────────────┐ │ Developer → Ticket → Ops Review → Manual Provision → Done │ │ │ │ Timeline: Days to weeks │ │ Bottleneck: Ops team capacity │ │ Result: Shadow IT, workarounds, frustration │ └─────────────────────────────────────────────────────────────┘
Self-Service Model: ┌─────────────────────────────────────────────────────────────┐ │ Developer → Portal/API → Automatic Provision → Done │ │ │ │ Timeline: Minutes to hours │ │ Bottleneck: None (automated) │ │ Result: Speed, consistency, compliance │ └─────────────────────────────────────────────────────────────┘
Self-Service Spectrum: ├── Fully Managed: Click a button, get a database ├── Template-Based: Customize from approved templates ├── Policy-Constrained: Write IaC within guardrails └── Full Freedom: Any infrastructure (risky)
Sweet Spot: Template-Based with Policy Guardrails
Key Benefits
Self-Service Benefits:
For Developers: ├── Speed: Minutes instead of days ├── Autonomy: Provision when needed ├── Consistency: Same infrastructure every time ├── Learning: Understand infrastructure better └── Ownership: More responsibility, more control
For Operations: ├── Scale: Handle more requests without more people ├── Consistency: Enforce standards automatically ├── Focus: Work on platform, not tickets ├── Audit: Clear trail of who provisioned what └── Compliance: Built-in policy enforcement
For Organization: ├── Velocity: Faster time to market ├── Cost: Reduced ops overhead ├── Governance: Better compliance posture ├── Security: Consistent security controls └── Efficiency: Resources provisioned when needed
Self-Service Architecture
Component Architecture
Self-Service Infrastructure Architecture:
┌─────────────────────────────────────────────────────────────┐ │ USER INTERFACE │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Portal │ │ CLI │ │ API │ │ │ │ (Web UI) │ │ (Terraform) │ │ (REST/gRPC)│ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ └────────────────┼────────────────┘ │ │ │ │ ├──────────────────────────┼───────────────────────────────────┤ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ ORCHESTRATION LAYER │ │ │ │ ├── Request validation │ │ │ │ ├── Policy evaluation (OPA/Sentinel) │ │ │ │ ├── Cost estimation │ │ │ │ ├── Approval workflow (if needed) │ │ │ │ └── Execution orchestration │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ├──────────────────────────┼───────────────────────────────────┤ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ TEMPLATE LIBRARY │ │ │ │ ├── Database modules (RDS, Cloud SQL) │ │ │ │ ├── Compute modules (EKS, GKE, VMs) │ │ │ │ ├── Storage modules (S3, GCS) │ │ │ │ ├── Network modules (VPC, subnets) │ │ │ │ └── Composite modules (full environments) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ├──────────────────────────┼───────────────────────────────────┤ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ EXECUTION ENGINE │ │ │ │ ├── Terraform Cloud/Enterprise │ │ │ │ ├── Pulumi Service │ │ │ │ ├── Crossplane │ │ │ │ └── Cloud-native (CDK, ARM, Deployment Manager) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ├──────────────────────────┼───────────────────────────────────┤ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ CLOUD PROVIDERS │ │ │ │ AWS │ GCP │ Azure │ Kubernetes │ Others │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘
Request Flow
Self-Service Request Flow:
┌─────────────────────────────────────────────────────────────┐ │ 1. REQUEST │ │ Developer: "I need a PostgreSQL database for staging" │ │ └── Via portal, CLI, or API │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 2. VALIDATION │ │ ├── User has permission? ✓ Team member │ │ ├── Request well-formed? ✓ Valid config │ │ ├── Within quotas? ✓ Under team limit │ │ └── Meets policy? ✓ Allowed instance type│ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 3. ENRICHMENT │ │ ├── Apply defaults db.t3.medium │ │ ├── Generate names myapp-staging-db │ │ ├── Assign network staging-vpc │ │ ├── Configure monitoring Datadog integration │ │ └── Estimate cost ~$50/month │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 4. APPROVAL (if required) │ │ ├── Auto-approve: staging, dev ✓ Auto-approved │ │ ├── Manual approve: production (Would need approval) │ │ └── Cost threshold: >$500/month (Would need approval) │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 5. EXECUTION │ │ ├── Generate Terraform Based on template │ │ ├── Plan Preview changes │ │ ├── Apply Create resources │ │ └── Verify Health checks │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 6. DELIVERY │ │ ├── Connection string → Vault │ │ ├── Notification → Slack/email │ │ ├── Documentation → Auto-generated │ │ └── Registration → Service catalog │ └─────────────────────────────────────────────────────────────┘
IaC Module Design
Terraform Module Patterns
Terraform Module Structure:
Organization-Wide Module Library: terraform-modules/ ├── databases/ │ ├── rds-postgres/ │ │ ├── main.tf │ │ ├── variables.tf │ │ ├── outputs.tf │ │ ├── versions.tf │ │ ├── README.md │ │ └── examples/ │ │ ├── simple/ │ │ └── production/ │ └── elasticache-redis/ ├── compute/ │ ├── eks-cluster/ │ └── ecs-service/ ├── storage/ │ └── s3-bucket/ └── network/ └── vpc/
Module Design Principles:
-
Opinionated Defaults
variables.tf
variable "instance_class" { type = string default = "db.t3.medium" # Sensible default description = "RDS instance type"
validation { condition = can(regex("^db\.(t3|r5|m5)", var.instance_class)) error_message = "Only approved instance families allowed." } }
-
Minimal Required Inputs
Only require what can't be defaulted
variable "name" { type = string description = "Database identifier" }
variable "environment" { type = string description = "Environment (dev, staging, prod)" }
-
Complete Outputs
outputs.tf
output "endpoint" { description = "Database connection endpoint" value = aws_db_instance.main.endpoint }
output "connection_secret_arn" { description = "ARN of secret with credentials" value = aws_secretsmanager_secret.db_credentials.arn }
-
Built-in Best Practices
Security hardened by default
resource "aws_db_instance" "main" {
Encryption always on
storage_encrypted = true
No public access
publicly_accessible = false
Automated backups
backup_retention_period = var.environment == "prod" ? 30 : 7
Enhanced monitoring
monitoring_interval = 60 }
Module Versioning
Module Versioning Strategy:
Semantic Versioning: ├── MAJOR: Breaking changes (new required inputs, removed outputs) ├── MINOR: New features (new optional inputs, new outputs) └── PATCH: Bug fixes (no interface changes)
Version Constraints:
Allow patch updates automatically
module "database" { source = "terraform.company.com/modules/rds-postgres" version = "~> 2.1.0" # >=2.1.0, <2.2.0 }
Pin to exact version (production)
module "database" { source = "terraform.company.com/modules/rds-postgres" version = "= 2.1.3" }
Deprecation Policy: ┌─────────────────────────────────────────────────────────────┐ │ Module Version Lifecycle │ ├─────────────────────────────────────────────────────────────┤ │ Current (v2.x): Supported, new features │ │ Previous (v1.x): Supported, security fixes only │ │ Deprecated (v0.x): Warning on use, no support │ │ Removed: Will not work │ │ │ │ Notification: │ │ ├── Slack announcement when version deprecated │ │ ├── Warning in terraform plan output │ │ ├── Dashboard showing deprecated module usage │ │ └── Migration guide provided │ └─────────────────────────────────────────────────────────────┘
Policy and Guardrails
Policy as Code
Policy as Code Options:
-
HashiCorp Sentinel (Terraform Enterprise)
Require encryption for all storage
import "tfplan/v2" as tfplan
s3_buckets = filter tfplan.resource_changes as _, rc { rc.type is "aws_s3_bucket" and rc.mode is "managed" and (rc.change.actions contains "create" or rc.change.actions contains "update") }
encryption_enabled = rule { all s3_buckets as _, bucket { bucket.change.after.server_side_encryption_configuration is not null } }
main = rule { encryption_enabled }
-
Open Policy Agent (OPA)
Rego policy for Kubernetes
package kubernetes.admission
deny[msg] { input.request.kind.kind == "Pod" container := input.request.object.spec.containers[_] not container.securityContext.runAsNonRoot msg := "Containers must run as non-root" }
-
Cloud-Native Policies
AWS Service Control Policy
{ "Version": "2012-10-17", "Statement": [{ "Sid": "RequireEncryption", "Effect": "Deny", "Action": ["s3:CreateBucket"], "Resource": "*", "Condition": { "StringNotEquals": { "s3:x-amz-server-side-encryption": "AES256" } } }] }
Guardrail Categories
Infrastructure Guardrails:
-
Security Guardrails ├── Encryption required (at-rest, in-transit) ├── No public access by default ├── Required security groups ├── IAM role requirements └── Vulnerability scanning
-
Cost Guardrails ├── Instance type restrictions ├── Storage size limits ├── Required cost tags ├── Budget thresholds └── Approval for large resources
-
Compliance Guardrails ├── Allowed regions (data residency) ├── Required logging ├── Backup requirements ├── Retention policies └── Audit trail requirements
-
Operational Guardrails ├── Naming conventions ├── Required tags (owner, cost-center) ├── Resource quotas per team ├── Monitoring requirements └── Deletion protection
Guardrail Implementation: ┌─────────────────────────────────────────────────────────────┐ │ Guardrail Timing │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Pre-Plan (fastest feedback): │ │ ├── Validate terraform files │ │ ├── Static analysis (tfsec, checkov) │ │ └── Module version checks │ │ │ │ Post-Plan (resource-aware): │ │ ├── OPA/Sentinel policy evaluation │ │ ├── Cost estimation │ │ └── Blast radius assessment │ │ │ │ Post-Apply (verification): │ │ ├── Configuration validation │ │ ├── Security scanning │ │ └── Compliance audit │ │ │ └─────────────────────────────────────────────────────────────┘
Environment Provisioning
Environment Templates
Environment Provisioning:
Environment Types: ┌─────────────────────────────────────────────────────────────┐ │ Development Environment │ │ ├── Purpose: Individual developer testing │ │ ├── Lifetime: Hours to days │ │ ├── Resources: Minimal (smallest instances) │ │ ├── Data: Synthetic or anonymized │ │ └── Approval: None (within quota) │ ├─────────────────────────────────────────────────────────────┤ │ Staging Environment │ │ ├── Purpose: Integration testing, QA │ │ ├── Lifetime: Persistent per service │ │ ├── Resources: Production-like (scaled down) │ │ ├── Data: Sanitized production subset │ │ └── Approval: None (within quota) │ ├─────────────────────────────────────────────────────────────┤ │ Production Environment │ │ ├── Purpose: Live customer traffic │ │ ├── Lifetime: Permanent │ │ ├── Resources: Full capacity │ │ ├── Data: Real customer data │ │ └── Approval: Required (security review) │ └─────────────────────────────────────────────────────────────┘
Environment Template:
environment/main.tf
module "network" { source = "../modules/vpc" environment = var.environment cidr_block = var.network_cidr }
module "kubernetes" { source = "../modules/eks" environment = var.environment vpc_id = module.network.vpc_id node_count = var.environment == "prod" ? 5 : 2 }
module "database" { source = "../modules/rds" environment = var.environment vpc_id = module.network.vpc_id instance_class = var.environment == "prod" ? "db.r5.xlarge" : "db.t3.medium" multi_az = var.environment == "prod" }
module "cache" { source = "../modules/elasticache" environment = var.environment vpc_id = module.network.vpc_id node_type = var.environment == "prod" ? "cache.r5.large" : "cache.t3.micro" }
Ephemeral Environments
Ephemeral/Preview Environments:
Use Cases: ├── PR preview environments ├── Feature branch testing ├── Demo environments ├── Load testing environments └── Incident reproduction
Lifecycle: ┌─────────────────────────────────────────────────────────────┐ │ │ │ PR Created ──► Environment Created ──► Tests Run │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ Preview URL PR Updated │ │ │ Posted to PR │ │ │ │ │ │ │ ▼ ▼ │ │ PR Merged ───────────────────────► Environment Destroyed │ │ │ │ Timeout: Auto-destroy after 7 days of inactivity │ │ │ └─────────────────────────────────────────────────────────────┘
Implementation:
.github/workflows/preview.yml
name: Preview Environment
on: pull_request: types: [opened, synchronize]
jobs:
deploy-preview:
runs-on: ubuntu-latest
steps:
- name: Create/Update Environment
run: |
terraform workspace select pr-${{ github.event.pull_request.number }} ||
terraform workspace new pr-${{ github.event.pull_request.number }}
terraform apply -auto-approve
- name: Comment Preview URL
uses: actions/github-script@v6
with:
script: |
github.rest.issues.createComment({
issue_number: context.issue.number,
body: '🚀 Preview: https://pr-${{ github.event.pull_request.number }}.preview.company.com'
})
Technology Options
Self-Service Platforms
Platform Comparison:
-
Terraform Cloud/Enterprise ├── Native Terraform experience ├── Policy as Code (Sentinel) ├── Private module registry ├── Cost estimation └── Enterprise features (SSO, audit)
-
Pulumi ├── Real programming languages ├── Strong typing and IDE support ├── Policy as Code (CrossGuard) └── Automation API
-
Crossplane ├── Kubernetes-native ├── GitOps workflow ├── Composition for modules └── Multi-cloud abstraction
-
Backstage + Terraform ├── Unified developer portal ├── Software templates ├── Plugin ecosystem └── Service catalog integration
-
Port/Cortex/OpsLevel ├── Commercial developer portals ├── Quick to implement ├── Built-in integrations └── Self-service workflows
Selection Criteria: ┌────────────────────────────────────────────────────────────┐ │ Factor │ Best Fit │ ├──────────────────────┼─────────────────────────────────────┤ │ Existing Terraform │ Terraform Cloud/Enterprise │ │ Kubernetes-first │ Crossplane │ │ Developer portal │ Backstage or commercial │ │ Programming language │ Pulumi │ │ Quick start │ Commercial (Port, OpsLevel) │ │ Maximum control │ Build custom │ └────────────────────────────────────────────────────────────┘
Cost Management
Cost Controls
Cost Management in Self-Service:
-
Cost Visibility ├── Estimated cost shown before provisioning ├── Cost tags automatically applied ├── Per-team/project dashboards └── Anomaly detection and alerts
-
Cost Guardrails ├── Instance type restrictions ├── Budget thresholds by team ├── Approval required above threshold └── Auto-shutdown of unused resources
-
Cost Optimization ├── Right-sizing recommendations ├── Reserved instance suggestions ├── Spot instance for non-production └── Scheduled scaling
Cost Estimation Flow: ┌─────────────────────────────────────────────────────────────┐ │ Request: PostgreSQL database for staging │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Cost Estimate: │ │ ├── Compute (db.t3.medium): $30/month │ │ ├── Storage (100GB gp3): $10/month │ │ ├── Backup storage: ~$5/month │ │ └── Data transfer: ~$5/month │ │ ───────── │ │ Estimated Total: ~$50/month │ │ │ │ ✓ Within team budget ($500/month quota) │ │ ✓ No approval required │ │ │ │ [Proceed] [Modify] [Cancel] │ └─────────────────────────────────────────────────────────────┘
Best Practices
Self-Service Infrastructure Best Practices:
-
Start Small, Expand Gradually ├── Begin with 2-3 common resources ├── Add based on demand ├── Iterate on feedback └── Don't try to cover everything day 1
-
Balance Autonomy and Governance ├── Guardrails not gates ├── Automate approvals where safe ├── Clear escalation paths └── Trust but verify
-
Optimize for Developer Experience ├── Minimal required inputs ├── Sensible defaults ├── Clear error messages └── Fast feedback loops
-
Maintain Module Quality ├── Automated testing ├── Documentation requirements ├── Versioning strategy └── Deprecation process
-
Monitor and Improve ├── Track provisioning success rate ├── Measure time to provision ├── Gather user feedback └── Identify automation opportunities
-
Handle Edge Cases ├── What if provisioning fails? ├── How to handle orphaned resources? ├── What about existing resources? └── How to migrate between versions?
Anti-Patterns
Self-Service Anti-Patterns:
-
"Self-Service Everything" ❌ Every possible configuration option ✓ Curated set of approved patterns
-
"Security Theater" ❌ Manual approvals that don't add value ✓ Automated policy enforcement
-
"Configuration Explosion" ❌ 50 parameters per resource ✓ Sensible defaults with few overrides
-
"Ignore Cost" ❌ No visibility into provisioned cost ✓ Cost estimation and budgets
-
"Build vs Buy Wrong" ❌ Building everything from scratch ✓ Use existing tools where appropriate
-
"No Escape Hatch" ❌ Blocking legitimate exceptions ✓ Process for justified deviations
Related Skills
-
internal-developer-platform
-
Platform engineering overview
-
golden-paths
-
Standardized workflows
-
container-orchestration
-
Kubernetes infrastructure
-
serverless-patterns
-
Serverless infrastructure