Deco Site Scaling Tuning

Analyze a site's Prometheus metrics to discover the optimal autoscaling parameters. This skill helps you find the CPU/concurrency threshold where latency degrades and recommends scaling configuration accordingly.

When to Use This Skill

A site is overscaled (too many pods for its traffic)
A site oscillates between scaling up and down (panic mode loop)
Need to switch scaling metric (concurrency vs CPU vs RPS)
Need to find the right target value for a site
After deploying scaling changes, to verify they're working

Prerequisites

kubectl access to the target cluster
Prometheus accessible via port-forward (from kube-prometheus-stack in monitoring namespace)
Python 3 for analysis scripts
At least 6 hours of metric history for meaningful analysis
For direct latency data: queue-proxy PodMonitor must be applied (see Step 0)

Quick Start

ENABLE METRICS → Apply queue-proxy PodMonitor if not already done
PORT-FORWARD → kubectl port-forward prometheus-pod 19090:9090
COLLECT DATA → Run analysis scripts against Prometheus
ANALYZE → Find CPU threshold where latency degrades
RECOMMEND → Choose scaling metric and target
APPLY → Use deco-site-deployment skill to apply changes
VERIFY → Monitor for 1-2 hours after change

Files in This Skill

File Purpose

SKILL.md

Overview, methodology, analysis procedures

analysis-scripts.md

Ready-to-use Python scripts for Prometheus queries

Step 0: Enable Queue-Proxy Metrics (one-time)

Queue-proxy runs as a sidecar on every Knative pod and exposes request latency histograms. These are critical for precise tuning but are not scraped by default.

Apply this PodMonitor:

apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: knative-queue-proxy namespace: monitoring labels: release: kube-prometheus-stack spec: namespaceSelector: any: true selector: matchExpressions: - key: serving.knative.dev/revision operator: Exists podMetricsEndpoints: - port: http-usermetric path: /metrics interval: 15s

kubectl apply -f queue-proxy-podmonitor.yaml

Wait 2-3 hours for data to accumulate before running latency analysis

Metrics unlocked by this PodMonitor:

revision_app_request_latencies_bucket — request latency histogram (p50/p95/p99)
revision_app_request_latencies_sum / _count — for avg latency
revision_app_request_count — request rate by response code

Step 1: Establish Prometheus Connection

PROM_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}') kubectl port-forward -n monitoring $PROM_POD 19090:9090 &

Verify

curl -s "http://127.0.0.1:19090/api/v1/query?query=up" | jq '.status'

Step 2: Collect Current State

Before analyzing, understand what the site is currently configured for.

2a. Read current autoscaler config

SITENAME="<sitename>" NS="sites-${SITENAME}"

Current revision annotations

kubectl get rev -n $NS -o json |
jq '.items[] | select(.status.conditions[]?.status == "True" and .status.conditions[]?.type == "Active") | {name: .metadata.name, annotations: .metadata.annotations | with_entries(select(.key | startswith("autoscaling")))}'

Global autoscaler defaults

kubectl get cm config-autoscaler -n knative-serving -o json | jq '.data | del(._example)'

2b. Current pod count and resources

kubectl get pods -n $NS --no-headers | wc -l kubectl top pods -n $NS --no-headers | head -20

Step 3: Run Analysis

Use the scripts in analysis-scripts.md . The analysis follows this methodology:

Methodology: Finding the Optimal CPU Target

Goal: Find the CPU level at which latency starts to degrade. This is your scaling target — keep pods below this CPU to maintain good latency.

Approach:

Collect CPU per pod, concurrency per pod, pod count, and (if available) request latency over 6-12 hours

Bucket data by CPU range (0-200m, 200-300m, ..., 700m+)

For each bucket, compute avg/p95 concurrency per pod

Compute the "latency inflation factor" — how much concurrency increases beyond what the pod count reduction explains:

excess = (avg_conc_above_threshold / avg_conc_below_threshold) / (avg_pods_below / avg_pods_above)

excess = 1.0 → concurrency increase fully explained by fewer pods (no latency degradation)
excess > 1.0 → latency is inflating concurrency (pods are slowing down)
The CPU level where excess crosses ~1.5x is your inflection point

If queue-proxy latency is available, directly plot avg latency vs CPU — the hockey stick inflection is your target

What to Look For

CPU vs Concurrency/pod:

Low CPU (0-200m) → Low conc/pod → Pods are idle (overprovisioned) Medium CPU (200-400m) → Moderate conc → Healthy range ★ INFLECTION ★ → Conc jumps → Latency starting to degrade High CPU (500m+) → High conc/pod → Pods overloaded, latency bad

The inflection point is where you want your scaling target.

Decision Matrix

IMPORTANT: CPU target is in millicores (not percentage). E.g., target: 400 means scale when CPU reaches 400m.

Inflection CPU Recommended metric Target Notes

< CPU request CPU scaling target = inflection value in millicores Standard case

~ CPU request CPU scaling target = CPU_request × 0.8 Conservative

CPU request (no limit) CPU scaling target = CPU_request × 0.8, increase CPU request Need more CPU headroom

No clear inflection Concurrency scaling Keep current but tune target CPU isn't the bottleneck

Common Patterns

Pattern: CPU-bound app (Deno SSR)

Baseline CPU: 200-300m (Deno runtime + V8 JIT)
Inflection: 400-500m
Recommendation: CPU scaling with target = inflection (e.g., 400 millicores)

Pattern: IO-bound app (mostly external API calls)

CPU stays low even under high concurrency
Inflection not visible in CPU
Recommendation: Keep concurrency scaling, tune the target

Pattern: Oscillating (panic loop)

Symptoms: pods cycle between min and max
Cause: concurrency scaling + low target + scale-down-delay ratchet
Fix: Switch to CPU scaling (breaks the latency→concurrency feedback loop)

Step 4: Apply Changes

Use the deco-site-deployment skill to:

Update the state secret with new scaling config
Redeploy on both clouds

Example for CPU-based scaling (target is in millicores):

NEW_STATE=$(echo "$STATE" | jq ' .scaling.metric = { "type": "cpu", "target": 400 } ')

Step 5: Verify After Change

Monitor for 1-2 hours after applying changes:

Watch pod count stabilize

watch -n 10 "kubectl get pods -n sites-<sitename> --no-headers | wc -l"

Check if panic mode triggers (should be N/A for HPA/CPU)

HPA doesn't have panic mode — this is one of the advantages

Verify HPA is active

kubectl get hpa -n sites-<sitename>

Check HPA status

kubectl describe hpa -n sites-<sitename>

Success Criteria

Pod count stabilizes (no more oscillation)
Avg CPU per pod stays below your target during normal traffic
CPU crosses target only during genuine traffic spikes (and scales up proportionally)
No panic mode events (HPA doesn't have panic mode)
Latency stays acceptable (check with queue-proxy metrics if available)

Rollback

If the new scaling is worse, revert by changing the state secret back to concurrency scaling:

NEW_STATE=$(echo "$STATE" | jq ' .scaling.metric = { "type": "concurrency", "target": 15, "targetUtilizationPercentage": 70 } ')

Related Skills

deco-site-deployment — Apply scaling changes and redeploy
deco-site-memory-debugging — Debug memory issues on running pods
deco-incident-debugging — Incident response and triage

deco-site-scaling-tuning

Safety Notice

Copy this and send it to your AI assistant to learn

Wait 2-3 hours for data to accumulate before running latency analysis

Verify

Current revision annotations

Global autoscaler defaults

Watch pod count stabilize

Check if panic mode triggers (should be N/A for HPA/CPU)

HPA doesn't have panic mode — this is one of the advantages

Verify HPA is active

Check HPA status

Source Transparency

Related Skills

deco-site-memory-debugging

deco-e2e-testing

deco-tanstack-search

deco-storefront-test-checklist