Cost Optimization

Purpose

Cloud cost optimization transforms uncontrolled spending into strategic resource allocation through the FinOps lifecycle: Inform, Optimize, and Operate. This skill provides decision frameworks for commitment-based discounts (Reserved Instances, Savings Plans), right-sizing strategies, Kubernetes cost management, and automated cost governance across multi-cloud environments.

When to Use This Skill

Invoke cost-optimization when:

Reducing cloud spend by 15-40% through systematic optimization
Implementing cost visibility dashboards and allocation tracking
Establishing budget alerts and anomaly detection
Optimizing Kubernetes resource requests and cluster efficiency
Managing Reserved Instances, Savings Plans, or Committed Use Discounts
Automating idle resource cleanup and right-sizing recommendations
Setting up showback/chargeback models for internal teams
Preventing cost overruns through CI/CD cost estimation (Infracost)
Responding to finance team requests for cloud cost reduction

FinOps Principles

The FinOps Lifecycle

┌─────────────────────────────────────────────────────┐
│  INFORM → OPTIMIZE → OPERATE (continuous loop)      │
│    ↓         ↓           ↓                          │
│ Visibility  Action   Automation                     │
└─────────────────────────────────────────────────────┘

Inform Phase: Establish cost visibility

Enable cost allocation tags (Owner, Project, Environment)
Deploy real-time cost dashboards for engineering teams
Integrate cloud billing data (AWS CUR, Azure Consumption API, GCP BigQuery)
Set up Kubernetes cost monitoring (Kubecost, OpenCost)

Optimize Phase: Take action on cost drivers

Purchase commitment-based discounts (40-72% savings)
Right-size over-provisioned resources (target 60-80% utilization)
Implement spot/preemptible instances for fault-tolerant workloads
Clean up idle resources (unattached volumes, old snapshots)

Operate Phase: Automate and govern

Budget alerts with cascading notifications (50%, 75%, 90%, 100%)
Automated cleanup scripts for idle resources
CI/CD cost estimation to prevent surprise increases
Continuous monitoring with anomaly detection

Core FinOps Principles

Collaboration: Cross-functional teams (finance, engineering, operations, product)
Accountability: Teams own the cost of their services
Transparency: All costs visible and understandable to stakeholders
Optimization: Continuous improvement of cost efficiency

For detailed FinOps maturity models and organizational structures, see references/finops-foundations.md.

Cost Optimization Strategies

1. Commitment-Based Discounts

Reserved Instances (RIs): 40-72% discount for 1-3 year commitments

Standard RI: Instance type locked, highest discount (60% for 3-year)
Convertible RI: Flexible instance types, moderate discount (54% for 3-year)
Use for: Databases (RDS, ElastiCache), stable production EC2 workloads

Savings Plans: Flexible compute commitments

Compute Savings Plans: Applies to EC2, Fargate, Lambda (54% discount for 3-year)
EC2 Instance Savings Plans: Tied to instance family (66% discount for 3-year)
Use for: Workloads that change instance types or regions

GCP Committed Use Discounts (CUDs): 25-70% discount

Resource-based CUDs: Commit to vCPU, memory, GPUs
Spend-based CUDs: Commit to dollar amount (flexible)
Sustained Use Discounts: Automatic 20-30% discount for sustained usage (no commitment)

Decision Framework:

Reserve when:
├─ Workload is production-critical (24/7 uptime required)
├─ Usage is predictable (stable baseline over 6+ months)
├─ Architecture is stable (unlikely to change instance types)
└─ Financial commitment acceptable (1-3 year lock-in)

Use On-Demand when:
├─ Development/testing environments
├─ Unpredictable spiky workloads
├─ Short-term projects (<6 months)
└─ Evaluating new instance types

For detailed commitment strategies and RI coverage analysis, see references/commitment-strategies.md.

2. Spot and Preemptible Instances

Discount: 70-90% off on-demand pricing (interruptible with 2-minute warning)

Use Spot For: CI/CD workers, batch jobs, ML training (with checkpointing), Kubernetes workers, data analytics Avoid Spot For: Stateful databases, real-time services, long-running jobs without checkpointing

Best Practices:

Diversify instance types and spread across Availability Zones
Implement graceful shutdown handlers
Auto-fallback to on-demand when capacity unavailable
Kubernetes: Mix 70% spot + 30% on-demand nodes with taints/tolerations

3. Right-Sizing Strategies

Target Utilization: 60-80% average (leave headroom for spikes)

Compute Right-Sizing:

Analyze actual CPU/memory utilization over 30+ days
Downsize instances with <40% average utilization
Consolidate underutilized workloads
Switch instance families (compute-optimized vs. memory-optimized)

Database Right-Sizing:

Analyze connection pool usage (max connections vs. allocated)
Downgrade storage IOPS if utilization <50%
Evaluate read replica necessity (can caching replace it?)
Consider serverless options (Aurora Serverless, Azure SQL Serverless)

Kubernetes Right-Sizing:

Set requests = average usage (not peak)
Set limits = 2-3x requests (allow bursting)
Use Vertical Pod Autoscaler (VPA) for automated recommendations
Identify pods with 0% CPU usage (candidates for consolidation)

Storage Right-Sizing:

Delete unattached volumes (EBS, Azure Disks, GCP Persistent Disks)
Delete old snapshots (>90 days, retention policy not required)
Implement lifecycle policies (S3 Intelligent-Tiering, Azure Blob Lifecycle)
Compress/deduplicate data

Right-Sizing Tools:

AWS Compute Optimizer: ML-based EC2, Lambda, EBS recommendations
Azure Advisor: VM rightsizing, reserved instance advice
GCP Recommender: VM, disk, commitment recommendations
VPA (Vertical Pod Autoscaler): Automated container resource requests

4. Kubernetes Cost Management

Resource Requests and Limits:

# Set requests = average usage (enables efficient bin-packing)
resources:
  requests:
    cpu: 500m        # 0.5 CPU cores (average usage)
    memory: 1Gi      # 1 GiB memory (average usage)
  limits:
    cpu: 1500m       # 1.5 CPU cores (3x requests, allows bursting)
    memory: 3Gi      # 3 GiB memory (3x requests)

Namespace Quotas: Prevent runaway resource consumption

ResourceQuota: Limit total CPU/memory per namespace
LimitRange: Default/max requests per pod
PriorityClass: Ensure critical pods get resources

Cluster Autoscaling:

Scale down idle nodes to reduce costs
Scale-to-zero for dev clusters during off-hours
Use multiple node pools (spot + on-demand mix)
Set max node limits to prevent overspend

Cost Visibility:

Deploy Kubecost or OpenCost for namespace-level cost tracking
Allocate costs by labels (team, project, environment)
Track idle cost (cluster capacity not allocated to workloads)
Generate showback/chargeback reports

For detailed Kubernetes cost optimization patterns, see references/kubernetes-cost-optimization.md.

Cost Visibility and Monitoring

Tagging for Cost Allocation

Required Tags:

Owner or Team - Responsible team/department
Project or Application - Business unit or application name
Environment - prod, staging, dev, test
CostCenter - Finance cost center code

Enable Cost Allocation Tags:

AWS: Activate tags in Cost Allocation Tags console
Azure: Apply tags via Azure Policy enforcement
GCP: Use labels on all resources, export to BigQuery

For comprehensive tagging strategies, see references/tagging-for-cost-allocation.md.

Monitoring and Dashboards

Native Cloud Tools:

AWS Cost Explorer: Analyze spending patterns, forecast costs
Azure Cost Management + Billing: Budget tracking, cost analysis
GCP Cloud Billing: BigQuery export for custom analysis

Third-Party Platforms:

Kubecost: Kubernetes cost visibility and optimization
CloudZero: Unit cost economics, anomaly detection
CloudHealth: Multi-cloud cost management
Infracost: Terraform cost estimation in CI/CD

Key Metrics to Track:

Total monthly cloud spend (trend over time)
Cost per service/team/project (allocation accuracy)
Unit cost metrics (cost per customer, cost per transaction)
Reserved Instance/Savings Plan utilization (target >95%)
Idle resource waste (target <5% of total spend)
Budget variance (forecasted vs. actual)

Budget Alerts and Anomaly Detection

Cascading Budget Alerts:

50% of budget  → Email to team lead (informational)
75% of budget  → Email + Slack to team (warning)
90% of budget  → Email + Slack + PagerDuty (urgent)
100% of budget → Automated shutdown (non-prod only) or escalation

Anomaly Detection: Alert on unexpected cost spikes

20% cost increase week-over-week
$500 unexpected daily cost spike
New resource types (unusual spend patterns)

Budget Granularity:

Organization-level (total cloud spend)
Department-level (engineering, data, marketing)
Project-level (per application/service)
Environment-level (prod vs. dev/staging)

Decision Frameworks

Framework 1: Commitment Discount Decision Tree

Should we purchase Reserved Instances / Savings Plans?

STEP 1: Analyze Historical Usage (6-12 months)
├─ Identify steady-state baseline (minimum usage)
├─ Exclude spiky/seasonal workloads
└─ Calculate: (baseline usage) / (total usage) = commitment %

STEP 2: Choose Commitment Type
├─ RESERVED INSTANCES
│   ├─ Pros: Highest discount (up to 72%)
│   ├─ Cons: Instance type locked (unless convertible)
│   └─ Use for: Databases, stable production workloads
│
├─ SAVINGS PLANS
│   ├─ Pros: Flexible (across instance types, regions)
│   ├─ Cons: Slightly lower discount than RI
│   └─ Use for: Compute workloads, Lambda, Fargate
│
└─ COMMITTED USE DISCOUNTS (GCP)
    ├─ Resource-based: vCPU/memory commitments
    └─ Spend-based: Dollar amount commitments

STEP 3: Determine Commitment Period
├─ 1-year commitment
│   ├─ Lower discount (40-50%)
│   └─ Less risk if architecture changes
│
└─ 3-year commitment
    ├─ Higher discount (60-72%)
    └─ Only for mature, stable workloads

STEP 4: Monitor and Optimize
├─ Target >95% RI/Savings Plan utilization
├─ Sell unused RIs on AWS Reserved Instance Marketplace
└─ Adjust commitments quarterly based on usage trends

Framework 2: Right-Sizing Priority Matrix

Cost Impact vs. Effort:

High Impact, Low Effort (DO FIRST):

Idle resources (100% waste): Stopped instances, unattached volumes, old snapshots
Unused NAT Gateways ($32/month each)
Over-provisioned databases (<20% CPU for 30 days)
Kubernetes pods with no resource requests set

High Impact, Medium Effort (DO SECOND):

Over-provisioned compute (<40% CPU/memory for 30 days)
Lambda functions with max memory >2x used memory
Storage optimization (S3 Intelligent-Tiering, gp3 vs. gp2)

Low Impact, High Effort (DO LAST):

Application code optimization (requires profiling, refactoring)
Architecture redesign (serverless migration, multi-region optimization)

Weekly Optimization Routine:

Delete idle resources (automated script)
Review top 10 cost drivers (manual analysis)
Right-size 3-5 instances/week (incremental approach)
Monitor impact (cost trend over 4 weeks)

Framework 3: Spot vs. On-Demand Decision

Should this workload use Spot/Preemptible instances?

├─ Is the workload fault-tolerant?
│   ├─ NO → Use On-Demand
│   └─ YES → Continue
│
├─ Is the workload stateless (or has checkpointing)?
│   ├─ NO → Use On-Demand (data loss risk)
│   └─ YES → Continue
│
├─ Can the workload handle interruptions gracefully?
│   ├─ NO → Use On-Demand
│   └─ YES → Continue
│
└─ Workload Type Assessment:
    ├─ Batch Jobs / CI/CD → ✅ Use Spot (70-90% savings)
    ├─ ML Training → ✅ Use Spot (with checkpointing)
    ├─ Kubernetes Workers → ✅ Use Spot (mixed with on-demand)
    ├─ Production API Servers → ⚠️ Mixed fleet (70% spot, 30% on-demand)
    ├─ Databases → ❌ Use On-Demand (or Reserved)
    └─ Real-time Services → ❌ Use On-Demand (or Reserved)

Tool Selection Guide

By Platform

Platform	Cost Visibility	Right-Sizing	Automation
AWS	Cost Explorer, CUR	Compute Optimizer	AWS Budgets, Lambda cleanup
Azure	Cost Management	Azure Advisor	Azure Policy, Automation
GCP	Cloud Billing	Recommender	Budget Alerts, Cloud Functions
Kubernetes	Kubecost, OpenCost	VPA	Cluster Autoscaler
Multi-Cloud	CloudZero, CloudHealth	Densify	ParkMyCloud

By Use Case

Use Case	Recommended Tool	Key Feature
K8s cost visibility	Kubecost	Real-time namespace cost allocation
Terraform cost estimation	Infracost	PR comments with cost diffs
Multi-cloud aggregation	CloudHealth	Unified cost view across AWS/Azure/GCP
Automated optimization	nOps (AWS), CAST AI (K8s)	ML-based automation
Unit cost economics	CloudZero	Cost per customer/transaction tracking
Spot instance management	Spot.io	Automated spot orchestration

For detailed tool comparisons and selection criteria, see references/tools-comparison.md.

Cloud-Specific Tactics

AWS Optimization Tactics

Enable Cost & Usage Reports (CUR): Export detailed billing to S3
Use AWS Compute Optimizer: ML-based EC2 rightsizing recommendations
Implement Savings Plans: More flexible than Reserved Instances
S3 Intelligent-Tiering: Automatic storage class optimization
Lambda Right-Sizing: Adjust memory allocation (CPU scales proportionally)
EBS gp3 Migration: 20% cheaper than gp2 with same performance

Azure Optimization Tactics

Enable Azure Advisor: VM rightsizing and reserved instance recommendations
Azure Hybrid Benefit: Bring Windows Server licenses for discounts
Dev/Test Pricing: Reduced rates for non-production workloads
Azure Spot VMs: Up to 90% discount for interruptible workloads
Storage Lifecycle Management: Auto-tier blobs to cool/archive tiers

GCP Optimization Tactics

Export Billing to BigQuery: Custom cost analysis with SQL
Sustained Use Discounts: Automatic 20-30% discount (no commitment)
Committed Use Discounts: 52-70% savings for 3-year commitments
Preemptible VMs: Up to 91% discount for batch workloads
GCP Recommender: Idle VM detection and rightsizing advice

For cloud-specific deep dives, see references/cloud-specific-tactics.md.

Implementation Checklist

Phase 1: Establish Visibility (Week 1-2)

Enable cost allocation tags (Owner, Project, Environment)
Activate cost allocation tags in cloud billing console
Deploy Kubecost for Kubernetes cost visibility (if using K8s)
Create cost dashboards (Grafana, CloudWatch, Azure Monitor, GCP)
Set up weekly cost reports (emailed to team leads)

Phase 2: Set Up Governance (Week 2-3)

Create budget alerts (50%, 75%, 90%, 100% thresholds)
Enable anomaly detection (>20% WoW increase)
Implement tagging policy enforcement (Azure Policy, AWS Config, GCP Org Policy)
Establish showback reports (cost by team/project)
Document cost ownership (who owns which services)

Phase 3: Quick Wins (Week 3-4)

Delete idle resources (unattached volumes, old snapshots)
Stop/terminate unused development instances
Right-size top 10 over-provisioned instances (<40% utilization)
Implement S3 Intelligent-Tiering or lifecycle policies
Evaluate Reserved Instance/Savings Plan coverage

Phase 4: Commitment Discounts (Month 2)

Analyze 6-12 months usage history
Calculate baseline usage for commitment sizing
Purchase Reserved Instances for databases
Purchase Savings Plans for compute workloads
Monitor RI/SP utilization (target >95%)

Phase 5: Automation (Month 2-3)

Deploy automated cleanup scripts (weekly schedule)
Integrate Infracost into CI/CD pipelines
Implement auto-shutdown for dev/test environments (off-hours)
Enable Vertical Pod Autoscaler (VPA) for K8s rightsizing
Set up Spot instance automation (Spot.io, CAST AI, or native)

Phase 6: Continuous Optimization (Ongoing)

Weekly cost reviews with engineering teams
Monthly optimization sprints (top cost drivers)
Quarterly commitment adjustments (RI/SP coverage)
Annual FinOps maturity assessment

Common Pitfalls

Pitfall 1: No Cost Visibility

❌ Problem: Finance team sees cloud bill at end of month, surprises everywhere ✅ Solution: Deploy real-time cost dashboards, daily Slack reports to engineering teams

Pitfall 2: Reserved Instance Underutilization

❌ Problem: Purchased 100 RIs, only using 60 (40% wasted commitment) ✅ Solution: Monitor RI utilization weekly (target >95%), sell unused RIs on marketplace

Pitfall 3: Missing Kubernetes Resource Requests

❌ Problem: Pods with no requests set → inefficient bin-packing → wasted nodes ✅ Solution: Use VPA to auto-generate recommendations, enforce via admission control

Pitfall 4: Idle Resources Not Cleaned Up

❌ Problem: 50 stopped EC2 instances (still paying for EBS), 200 unattached volumes ✅ Solution: Weekly automated cleanup of idle resources >7 days old

Pitfall 5: No Budget Alerts

❌ Problem: Accidentally left test cluster running, $10K bill surprise ✅ Solution: Budget alerts at 50%, 75%, 90%, 100% with Slack/PagerDuty notifications

Related Skills

resource-tagging: Cost allocation tags enable showback/chargeback models
kubernetes-operations: K8s rightsizing, VPA, cluster autoscaling for cost optimization
infrastructure-as-code: Infracost for Terraform cost estimation and policy-as-code
aws-patterns: AWS-specific cost optimization tactics (EC2, RDS, S3, Lambda)
gcp-patterns: GCP-specific optimizations (Compute Engine, BigQuery, Cloud Storage)
azure-patterns: Azure-specific optimizations (VMs, Storage, App Service, Functions)
platform-engineering: Internal FinOps platforms and self-service cost dashboards
disaster-recovery: Balance cost vs. RTO/RPO (warm standby vs. cold standby)

Examples

See examples/ directory for:

terraform/: AWS, Azure, GCP cost optimization infrastructure (budgets, alerts)
kubernetes/: Kubecost deployment, resource quotas, VPA configurations
ci-cd/: Infracost GitHub Actions, cost approval workflows
dashboards/: Grafana cost dashboards, CloudWatch alarms

Scripts

See scripts/ directory for:

cleanup_idle_resources.py: Automated AWS/Azure/GCP idle resource cleanup
ri_coverage_report.py: Reserved Instance coverage analysis
cost_allocation_report.py: Generate showback/chargeback reports
spot_savings_calculator.py: Estimate savings from spot instances
k8s_rightsizing_audit.py: Find K8s pods with missing resource requests

Key Takeaways

FinOps is a Culture: Collaboration between finance, engineering, and operations
Visibility First: Can't optimize what can't measure (tags + dashboards mandatory)
Commitment = Savings: Reserved Instances/Savings Plans provide 40-72% discounts
Right-Size Continuously: Target 60-80% utilization (leave headroom for spikes)
Automate Cleanup: Idle resources are 100% waste (weekly automated deletion)
Kubernetes Costs Hidden: Use Kubecost/OpenCost for namespace-level visibility
Shift-Left Cost Awareness: Infracost in CI/CD prevents surprise cost increases
Budget Alerts Prevent Overspend: Cascading notifications at 50%, 75%, 90%, 100%
Spot for Fault-Tolerant Workloads: 70-90% discount (CI/CD, batch jobs, ML training)
Unit Cost Metrics Drive Value: Track cost per customer, cost per transaction

optimizing-costs

Safety Notice

Copy this and send it to your AI assistant to learn

Cost Optimization

Purpose

When to Use This Skill

FinOps Principles

The FinOps Lifecycle

Core FinOps Principles

Cost Optimization Strategies

1. Commitment-Based Discounts

2. Spot and Preemptible Instances

3. Right-Sizing Strategies

4. Kubernetes Cost Management

Cost Visibility and Monitoring

Tagging for Cost Allocation

Monitoring and Dashboards

Budget Alerts and Anomaly Detection

Decision Frameworks

Framework 1: Commitment Discount Decision Tree

Framework 2: Right-Sizing Priority Matrix

Framework 3: Spot vs. On-Demand Decision

Tool Selection Guide

By Platform

By Use Case

Cloud-Specific Tactics

AWS Optimization Tactics

Azure Optimization Tactics

GCP Optimization Tactics

Implementation Checklist

Phase 1: Establish Visibility (Week 1-2)

Phase 2: Set Up Governance (Week 2-3)

Phase 3: Quick Wins (Week 3-4)

Phase 4: Commitment Discounts (Month 2)

Phase 5: Automation (Month 2-3)

Phase 6: Continuous Optimization (Ongoing)

Common Pitfalls

Pitfall 1: No Cost Visibility

Pitfall 2: Reserved Instance Underutilization

Pitfall 3: Missing Kubernetes Resource Requests

Pitfall 4: Idle Resources Not Cleaned Up

Pitfall 5: No Budget Alerts

Related Skills

Examples

Scripts

Key Takeaways

Source Transparency

Related Skills

managing-git-workflows

implementing-mlops

implementing-gitops

creating-dashboards