Self-Service Infrastructure

Patterns for enabling developers to provision infrastructure without tickets, while maintaining governance and control.

When to Use This Skill

Designing infrastructure self-service capabilities
Creating reusable Terraform/Pulumi modules
Building environment provisioning systems
Implementing infrastructure guardrails
Reducing infrastructure request bottlenecks
Balancing developer autonomy with governance

Self-Service Fundamentals

What is Self-Service Infrastructure?

Self-Service Infrastructure: Enabling developers to provision and manage infrastructure directly, without filing tickets or waiting for ops teams.

Traditional Model: ┌─────────────────────────────────────────────────────────────┐ │ Developer → Ticket → Ops Review → Manual Provision → Done │ │ │ │ Timeline: Days to weeks │ │ Bottleneck: Ops team capacity │ │ Result: Shadow IT, workarounds, frustration │ └─────────────────────────────────────────────────────────────┘

Self-Service Model: ┌─────────────────────────────────────────────────────────────┐ │ Developer → Portal/API → Automatic Provision → Done │ │ │ │ Timeline: Minutes to hours │ │ Bottleneck: None (automated) │ │ Result: Speed, consistency, compliance │ └─────────────────────────────────────────────────────────────┘

Self-Service Spectrum: ├── Fully Managed: Click a button, get a database ├── Template-Based: Customize from approved templates ├── Policy-Constrained: Write IaC within guardrails └── Full Freedom: Any infrastructure (risky)

Sweet Spot: Template-Based with Policy Guardrails

Key Benefits

Self-Service Benefits:

For Developers: ├── Speed: Minutes instead of days ├── Autonomy: Provision when needed ├── Consistency: Same infrastructure every time ├── Learning: Understand infrastructure better └── Ownership: More responsibility, more control

For Operations: ├── Scale: Handle more requests without more people ├── Consistency: Enforce standards automatically ├── Focus: Work on platform, not tickets ├── Audit: Clear trail of who provisioned what └── Compliance: Built-in policy enforcement

For Organization: ├── Velocity: Faster time to market ├── Cost: Reduced ops overhead ├── Governance: Better compliance posture ├── Security: Consistent security controls └── Efficiency: Resources provisioned when needed

Self-Service Architecture

Component Architecture

Self-Service Infrastructure Architecture:

┌─────────────────────────────────────────────────────────────┐ │ USER INTERFACE │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Portal │ │ CLI │ │ API │ │ │ │ (Web UI) │ │ (Terraform) │ │ (REST/gRPC)│ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ └────────────────┼────────────────┘ │ │ │ │ ├──────────────────────────┼───────────────────────────────────┤ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ ORCHESTRATION LAYER │ │ │ │ ├── Request validation │ │ │ │ ├── Policy evaluation (OPA/Sentinel) │ │ │ │ ├── Cost estimation │ │ │ │ ├── Approval workflow (if needed) │ │ │ │ └── Execution orchestration │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ├──────────────────────────┼───────────────────────────────────┤ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ TEMPLATE LIBRARY │ │ │ │ ├── Database modules (RDS, Cloud SQL) │ │ │ │ ├── Compute modules (EKS, GKE, VMs) │ │ │ │ ├── Storage modules (S3, GCS) │ │ │ │ ├── Network modules (VPC, subnets) │ │ │ │ └── Composite modules (full environments) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ├──────────────────────────┼───────────────────────────────────┤ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ EXECUTION ENGINE │ │ │ │ ├── Terraform Cloud/Enterprise │ │ │ │ ├── Pulumi Service │ │ │ │ ├── Crossplane │ │ │ │ └── Cloud-native (CDK, ARM, Deployment Manager) │ │ │ └─────────────────────────────────────────────────────┘ │ │ │ │ ├──────────────────────────┼───────────────────────────────────┤ │ ▼ │ │ ┌─────────────────────────────────────────────────────┐ │ │ │ CLOUD PROVIDERS │ │ │ │ AWS │ GCP │ Azure │ Kubernetes │ Others │ │ │ └─────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────┘

Request Flow

Self-Service Request Flow:

┌─────────────────────────────────────────────────────────────┐ │ 1. REQUEST │ │ Developer: "I need a PostgreSQL database for staging" │ │ └── Via portal, CLI, or API │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 2. VALIDATION │ │ ├── User has permission? ✓ Team member │ │ ├── Request well-formed? ✓ Valid config │ │ ├── Within quotas? ✓ Under team limit │ │ └── Meets policy? ✓ Allowed instance type│ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 3. ENRICHMENT │ │ ├── Apply defaults db.t3.medium │ │ ├── Generate names myapp-staging-db │ │ ├── Assign network staging-vpc │ │ ├── Configure monitoring Datadog integration │ │ └── Estimate cost ~$50/month │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 4. APPROVAL (if required) │ │ ├── Auto-approve: staging, dev ✓ Auto-approved │ │ ├── Manual approve: production (Would need approval) │ │ └── Cost threshold: >$500/month (Would need approval) │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 5. EXECUTION │ │ ├── Generate Terraform Based on template │ │ ├── Plan Preview changes │ │ ├── Apply Create resources │ │ └── Verify Health checks │ └─────────────────────────────────────────────────────────────┘ │ ▼ ┌─────────────────────────────────────────────────────────────┐ │ 6. DELIVERY │ │ ├── Connection string → Vault │ │ ├── Notification → Slack/email │ │ ├── Documentation → Auto-generated │ │ └── Registration → Service catalog │ └─────────────────────────────────────────────────────────────┘

IaC Module Design

Terraform Module Patterns

Terraform Module Structure:

Organization-Wide Module Library: terraform-modules/ ├── databases/ │ ├── rds-postgres/ │ │ ├── main.tf │ │ ├── variables.tf │ │ ├── outputs.tf │ │ ├── versions.tf │ │ ├── README.md │ │ └── examples/ │ │ ├── simple/ │ │ └── production/ │ └── elasticache-redis/ ├── compute/ │ ├── eks-cluster/ │ └── ecs-service/ ├── storage/ │ └── s3-bucket/ └── network/ └── vpc/

Module Design Principles:

Opinionated Defaults

variables.tf

variable "instance_class" { type = string default = "db.t3.medium" # Sensible default description = "RDS instance type"

validation { condition = can(regex("^db\.(t3|r5|m5)", var.instance_class)) error_message = "Only approved instance families allowed." } }
Minimal Required Inputs

Only require what can't be defaulted

variable "name" { type = string description = "Database identifier" }

variable "environment" { type = string description = "Environment (dev, staging, prod)" }
Complete Outputs

outputs.tf

output "endpoint" { description = "Database connection endpoint" value = aws_db_instance.main.endpoint }

output "connection_secret_arn" { description = "ARN of secret with credentials" value = aws_secretsmanager_secret.db_credentials.arn }
Built-in Best Practices

Security hardened by default

resource "aws_db_instance" "main" {

Encryption always on

storage_encrypted = true

No public access

publicly_accessible = false

Automated backups

backup_retention_period = var.environment == "prod" ? 30 : 7

Enhanced monitoring

monitoring_interval = 60 }

Module Versioning

Module Versioning Strategy:

Semantic Versioning: ├── MAJOR: Breaking changes (new required inputs, removed outputs) ├── MINOR: New features (new optional inputs, new outputs) └── PATCH: Bug fixes (no interface changes)

Version Constraints:

Allow patch updates automatically

module "database" { source = "terraform.company.com/modules/rds-postgres" version = "~> 2.1.0" # >=2.1.0, <2.2.0 }

Pin to exact version (production)

module "database" { source = "terraform.company.com/modules/rds-postgres" version = "= 2.1.3" }

Deprecation Policy: ┌─────────────────────────────────────────────────────────────┐ │ Module Version Lifecycle │ ├─────────────────────────────────────────────────────────────┤ │ Current (v2.x): Supported, new features │ │ Previous (v1.x): Supported, security fixes only │ │ Deprecated (v0.x): Warning on use, no support │ │ Removed: Will not work │ │ │ │ Notification: │ │ ├── Slack announcement when version deprecated │ │ ├── Warning in terraform plan output │ │ ├── Dashboard showing deprecated module usage │ │ └── Migration guide provided │ └─────────────────────────────────────────────────────────────┘

Policy and Guardrails

Policy as Code

Policy as Code Options:

HashiCorp Sentinel (Terraform Enterprise)

Require encryption for all storage

import "tfplan/v2" as tfplan

s3_buckets = filter tfplan.resource_changes as _, rc { rc.type is "aws_s3_bucket" and rc.mode is "managed" and (rc.change.actions contains "create" or rc.change.actions contains "update") }

encryption_enabled = rule { all s3_buckets as _, bucket { bucket.change.after.server_side_encryption_configuration is not null } }

main = rule { encryption_enabled }
Open Policy Agent (OPA)

Rego policy for Kubernetes

package kubernetes.admission

deny[msg] { input.request.kind.kind == "Pod" container := input.request.object.spec.containers[_] not container.securityContext.runAsNonRoot msg := "Containers must run as non-root" }
Cloud-Native Policies

AWS Service Control Policy

{ "Version": "2012-10-17", "Statement": [{ "Sid": "RequireEncryption", "Effect": "Deny", "Action": ["s3:CreateBucket"], "Resource": "*", "Condition": { "StringNotEquals": { "s3:x-amz-server-side-encryption": "AES256" } } }] }

Guardrail Categories

Infrastructure Guardrails:

Security Guardrails ├── Encryption required (at-rest, in-transit) ├── No public access by default ├── Required security groups ├── IAM role requirements └── Vulnerability scanning
Cost Guardrails ├── Instance type restrictions ├── Storage size limits ├── Required cost tags ├── Budget thresholds └── Approval for large resources
Compliance Guardrails ├── Allowed regions (data residency) ├── Required logging ├── Backup requirements ├── Retention policies └── Audit trail requirements
Operational Guardrails ├── Naming conventions ├── Required tags (owner, cost-center) ├── Resource quotas per team ├── Monitoring requirements └── Deletion protection

Guardrail Implementation: ┌─────────────────────────────────────────────────────────────┐ │ Guardrail Timing │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Pre-Plan (fastest feedback): │ │ ├── Validate terraform files │ │ ├── Static analysis (tfsec, checkov) │ │ └── Module version checks │ │ │ │ Post-Plan (resource-aware): │ │ ├── OPA/Sentinel policy evaluation │ │ ├── Cost estimation │ │ └── Blast radius assessment │ │ │ │ Post-Apply (verification): │ │ ├── Configuration validation │ │ ├── Security scanning │ │ └── Compliance audit │ │ │ └─────────────────────────────────────────────────────────────┘

Environment Provisioning

Environment Templates

Environment Provisioning:

Environment Types: ┌─────────────────────────────────────────────────────────────┐ │ Development Environment │ │ ├── Purpose: Individual developer testing │ │ ├── Lifetime: Hours to days │ │ ├── Resources: Minimal (smallest instances) │ │ ├── Data: Synthetic or anonymized │ │ └── Approval: None (within quota) │ ├─────────────────────────────────────────────────────────────┤ │ Staging Environment │ │ ├── Purpose: Integration testing, QA │ │ ├── Lifetime: Persistent per service │ │ ├── Resources: Production-like (scaled down) │ │ ├── Data: Sanitized production subset │ │ └── Approval: None (within quota) │ ├─────────────────────────────────────────────────────────────┤ │ Production Environment │ │ ├── Purpose: Live customer traffic │ │ ├── Lifetime: Permanent │ │ ├── Resources: Full capacity │ │ ├── Data: Real customer data │ │ └── Approval: Required (security review) │ └─────────────────────────────────────────────────────────────┘

Environment Template:

environment/main.tf

module "network" { source = "../modules/vpc" environment = var.environment cidr_block = var.network_cidr }

module "kubernetes" { source = "../modules/eks" environment = var.environment vpc_id = module.network.vpc_id node_count = var.environment == "prod" ? 5 : 2 }

module "database" { source = "../modules/rds" environment = var.environment vpc_id = module.network.vpc_id instance_class = var.environment == "prod" ? "db.r5.xlarge" : "db.t3.medium" multi_az = var.environment == "prod" }

module "cache" { source = "../modules/elasticache" environment = var.environment vpc_id = module.network.vpc_id node_type = var.environment == "prod" ? "cache.r5.large" : "cache.t3.micro" }

Ephemeral Environments

Ephemeral/Preview Environments:

Use Cases: ├── PR preview environments ├── Feature branch testing ├── Demo environments ├── Load testing environments └── Incident reproduction

Lifecycle: ┌─────────────────────────────────────────────────────────────┐ │ │ │ PR Created ──► Environment Created ──► Tests Run │ │ │ │ │ │ │ │ ▼ ▼ │ │ │ Preview URL PR Updated │ │ │ Posted to PR │ │ │ │ │ │ │ ▼ ▼ │ │ PR Merged ───────────────────────► Environment Destroyed │ │ │ │ Timeout: Auto-destroy after 7 days of inactivity │ │ │ └─────────────────────────────────────────────────────────────┘

Implementation:

.github/workflows/preview.yml

name: Preview Environment

on: pull_request: types: [opened, synchronize]

jobs: deploy-preview: runs-on: ubuntu-latest steps: - name: Create/Update Environment run: | terraform workspace select pr-${{ github.event.pull_request.number }} ||
terraform workspace new pr-${{ github.event.pull_request.number }} terraform apply -auto-approve

  - name: Comment Preview URL
    uses: actions/github-script@v6
    with:
      script: |
        github.rest.issues.createComment({
          issue_number: context.issue.number,
          body: '🚀 Preview: https://pr-${{ github.event.pull_request.number }}.preview.company.com'
        })

Technology Options

Self-Service Platforms

Platform Comparison:

Terraform Cloud/Enterprise ├── Native Terraform experience ├── Policy as Code (Sentinel) ├── Private module registry ├── Cost estimation └── Enterprise features (SSO, audit)
Pulumi ├── Real programming languages ├── Strong typing and IDE support ├── Policy as Code (CrossGuard) └── Automation API
Crossplane ├── Kubernetes-native ├── GitOps workflow ├── Composition for modules └── Multi-cloud abstraction
Backstage + Terraform ├── Unified developer portal ├── Software templates ├── Plugin ecosystem └── Service catalog integration
Port/Cortex/OpsLevel ├── Commercial developer portals ├── Quick to implement ├── Built-in integrations └── Self-service workflows

Selection Criteria: ┌────────────────────────────────────────────────────────────┐ │ Factor │ Best Fit │ ├──────────────────────┼─────────────────────────────────────┤ │ Existing Terraform │ Terraform Cloud/Enterprise │ │ Kubernetes-first │ Crossplane │ │ Developer portal │ Backstage or commercial │ │ Programming language │ Pulumi │ │ Quick start │ Commercial (Port, OpsLevel) │ │ Maximum control │ Build custom │ └────────────────────────────────────────────────────────────┘

Cost Management

Cost Controls

Cost Management in Self-Service:

Cost Visibility ├── Estimated cost shown before provisioning ├── Cost tags automatically applied ├── Per-team/project dashboards └── Anomaly detection and alerts
Cost Guardrails ├── Instance type restrictions ├── Budget thresholds by team ├── Approval required above threshold └── Auto-shutdown of unused resources
Cost Optimization ├── Right-sizing recommendations ├── Reserved instance suggestions ├── Spot instance for non-production └── Scheduled scaling

Cost Estimation Flow: ┌─────────────────────────────────────────────────────────────┐ │ Request: PostgreSQL database for staging │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Cost Estimate: │ │ ├── Compute (db.t3.medium): $30/month │ │ ├── Storage (100GB gp3): $10/month │ │ ├── Backup storage: ~$5/month │ │ └── Data transfer: ~$5/month │ │ ───────── │ │ Estimated Total: ~$50/month │ │ │ │ ✓ Within team budget ($500/month quota) │ │ ✓ No approval required │ │ │ │ [Proceed] [Modify] [Cancel] │ └─────────────────────────────────────────────────────────────┘

Best Practices

Self-Service Infrastructure Best Practices:

Start Small, Expand Gradually ├── Begin with 2-3 common resources ├── Add based on demand ├── Iterate on feedback └── Don't try to cover everything day 1
Balance Autonomy and Governance ├── Guardrails not gates ├── Automate approvals where safe ├── Clear escalation paths └── Trust but verify
Optimize for Developer Experience ├── Minimal required inputs ├── Sensible defaults ├── Clear error messages └── Fast feedback loops
Maintain Module Quality ├── Automated testing ├── Documentation requirements ├── Versioning strategy └── Deprecation process
Monitor and Improve ├── Track provisioning success rate ├── Measure time to provision ├── Gather user feedback └── Identify automation opportunities
Handle Edge Cases ├── What if provisioning fails? ├── How to handle orphaned resources? ├── What about existing resources? └── How to migrate between versions?

Anti-Patterns

Self-Service Anti-Patterns:

"Self-Service Everything" ❌ Every possible configuration option ✓ Curated set of approved patterns
"Security Theater" ❌ Manual approvals that don't add value ✓ Automated policy enforcement
"Configuration Explosion" ❌ 50 parameters per resource ✓ Sensible defaults with few overrides
"Ignore Cost" ❌ No visibility into provisioned cost ✓ Cost estimation and budgets
"Build vs Buy Wrong" ❌ Building everything from scratch ✓ Use existing tools where appropriate
"No Escape Hatch" ❌ Blocking legitimate exceptions ✓ Process for justified deviations

Related Skills

internal-developer-platform
Platform engineering overview
golden-paths
Standardized workflows
container-orchestration
Kubernetes infrastructure
serverless-patterns
Serverless infrastructure

self-service-infrastructure

Safety Notice

Copy this and send it to your AI assistant to learn

variables.tf

Only require what can't be defaulted

outputs.tf

Security hardened by default

Encryption always on

No public access

Automated backups

Enhanced monitoring

Allow patch updates automatically

Pin to exact version (production)

Require encryption for all storage

Rego policy for Kubernetes

AWS Service Control Policy

environment/main.tf

.github/workflows/preview.yml

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering

resume-optimization