Deco Site Scaling Tuning
Analyze a site's Prometheus metrics to discover the optimal autoscaling parameters. This skill helps you find the CPU/concurrency threshold where latency degrades and recommends scaling configuration accordingly.
When to Use This Skill
-
A site is overscaled (too many pods for its traffic)
-
A site oscillates between scaling up and down (panic mode loop)
-
Need to switch scaling metric (concurrency vs CPU vs RPS)
-
Need to find the right target value for a site
-
After deploying scaling changes, to verify they're working
Prerequisites
-
kubectl access to the target cluster
-
Prometheus accessible via port-forward (from kube-prometheus-stack in monitoring namespace)
-
Python 3 for analysis scripts
-
At least 6 hours of metric history for meaningful analysis
-
For direct latency data: queue-proxy PodMonitor must be applied (see Step 0)
Quick Start
- ENABLE METRICS → Apply queue-proxy PodMonitor if not already done
- PORT-FORWARD → kubectl port-forward prometheus-pod 19090:9090
- COLLECT DATA → Run analysis scripts against Prometheus
- ANALYZE → Find CPU threshold where latency degrades
- RECOMMEND → Choose scaling metric and target
- APPLY → Use deco-site-deployment skill to apply changes
- VERIFY → Monitor for 1-2 hours after change
Files in This Skill
File Purpose
SKILL.md
Overview, methodology, analysis procedures
analysis-scripts.md
Ready-to-use Python scripts for Prometheus queries
Step 0: Enable Queue-Proxy Metrics (one-time)
Queue-proxy runs as a sidecar on every Knative pod and exposes request latency histograms. These are critical for precise tuning but are not scraped by default.
Apply this PodMonitor:
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: knative-queue-proxy namespace: monitoring labels: release: kube-prometheus-stack spec: namespaceSelector: any: true selector: matchExpressions: - key: serving.knative.dev/revision operator: Exists podMetricsEndpoints: - port: http-usermetric path: /metrics interval: 15s
kubectl apply -f queue-proxy-podmonitor.yaml
Wait 2-3 hours for data to accumulate before running latency analysis
Metrics unlocked by this PodMonitor:
-
revision_app_request_latencies_bucket — request latency histogram (p50/p95/p99)
-
revision_app_request_latencies_sum / _count — for avg latency
-
revision_app_request_count — request rate by response code
Step 1: Establish Prometheus Connection
PROM_POD=$(kubectl get pods -n monitoring -l app.kubernetes.io/name=prometheus -o jsonpath='{.items[0].metadata.name}') kubectl port-forward -n monitoring $PROM_POD 19090:9090 &
Verify
curl -s "http://127.0.0.1:19090/api/v1/query?query=up" | jq '.status'
Step 2: Collect Current State
Before analyzing, understand what the site is currently configured for.
2a. Read current autoscaler config
SITENAME="<sitename>" NS="sites-${SITENAME}"
Current revision annotations
kubectl get rev -n $NS -o json |
jq '.items[] | select(.status.conditions[]?.status == "True" and .status.conditions[]?.type == "Active") |
{name: .metadata.name, annotations: .metadata.annotations | with_entries(select(.key | startswith("autoscaling")))}'
Global autoscaler defaults
kubectl get cm config-autoscaler -n knative-serving -o json | jq '.data | del(._example)'
2b. Current pod count and resources
kubectl get pods -n $NS --no-headers | wc -l kubectl top pods -n $NS --no-headers | head -20
Step 3: Run Analysis
Use the scripts in analysis-scripts.md . The analysis follows this methodology:
Methodology: Finding the Optimal CPU Target
Goal: Find the CPU level at which latency starts to degrade. This is your scaling target — keep pods below this CPU to maintain good latency.
Approach:
Collect CPU per pod, concurrency per pod, pod count, and (if available) request latency over 6-12 hours
Bucket data by CPU range (0-200m, 200-300m, ..., 700m+)
For each bucket, compute avg/p95 concurrency per pod
Compute the "latency inflation factor" — how much concurrency increases beyond what the pod count reduction explains:
excess = (avg_conc_above_threshold / avg_conc_below_threshold) / (avg_pods_below / avg_pods_above)
-
excess = 1.0 → concurrency increase fully explained by fewer pods (no latency degradation)
-
excess > 1.0 → latency is inflating concurrency (pods are slowing down)
-
The CPU level where excess crosses ~1.5x is your inflection point
If queue-proxy latency is available, directly plot avg latency vs CPU — the hockey stick inflection is your target
What to Look For
CPU vs Concurrency/pod:
Low CPU (0-200m) → Low conc/pod → Pods are idle (overprovisioned) Medium CPU (200-400m) → Moderate conc → Healthy range ★ INFLECTION ★ → Conc jumps → Latency starting to degrade High CPU (500m+) → High conc/pod → Pods overloaded, latency bad
The inflection point is where you want your scaling target.
Decision Matrix
IMPORTANT: CPU target is in millicores (not percentage). E.g., target: 400 means scale when CPU reaches 400m.
Inflection CPU Recommended metric Target Notes
< CPU request CPU scaling target = inflection value in millicores Standard case
~ CPU request CPU scaling target = CPU_request × 0.8 Conservative
CPU request (no limit) CPU scaling target = CPU_request × 0.8, increase CPU request Need more CPU headroom
No clear inflection Concurrency scaling Keep current but tune target CPU isn't the bottleneck
Common Patterns
Pattern: CPU-bound app (Deno SSR)
-
Baseline CPU: 200-300m (Deno runtime + V8 JIT)
-
Inflection: 400-500m
-
Recommendation: CPU scaling with target = inflection (e.g., 400 millicores)
Pattern: IO-bound app (mostly external API calls)
-
CPU stays low even under high concurrency
-
Inflection not visible in CPU
-
Recommendation: Keep concurrency scaling, tune the target
Pattern: Oscillating (panic loop)
-
Symptoms: pods cycle between min and max
-
Cause: concurrency scaling + low target + scale-down-delay ratchet
-
Fix: Switch to CPU scaling (breaks the latency→concurrency feedback loop)
Step 4: Apply Changes
Use the deco-site-deployment skill to:
-
Update the state secret with new scaling config
-
Redeploy on both clouds
Example for CPU-based scaling (target is in millicores):
NEW_STATE=$(echo "$STATE" | jq ' .scaling.metric = { "type": "cpu", "target": 400 } ')
Step 5: Verify After Change
Monitor for 1-2 hours after applying changes:
Watch pod count stabilize
watch -n 10 "kubectl get pods -n sites-<sitename> --no-headers | wc -l"
Check if panic mode triggers (should be N/A for HPA/CPU)
HPA doesn't have panic mode — this is one of the advantages
Verify HPA is active
kubectl get hpa -n sites-<sitename>
Check HPA status
kubectl describe hpa -n sites-<sitename>
Success Criteria
-
Pod count stabilizes (no more oscillation)
-
Avg CPU per pod stays below your target during normal traffic
-
CPU crosses target only during genuine traffic spikes (and scales up proportionally)
-
No panic mode events (HPA doesn't have panic mode)
-
Latency stays acceptable (check with queue-proxy metrics if available)
Rollback
If the new scaling is worse, revert by changing the state secret back to concurrency scaling:
NEW_STATE=$(echo "$STATE" | jq ' .scaling.metric = { "type": "concurrency", "target": 15, "targetUtilizationPercentage": 70 } ')
Related Skills
-
deco-site-deployment — Apply scaling changes and redeploy
-
deco-site-memory-debugging — Debug memory issues on running pods
-
deco-incident-debugging — Incident response and triage