Performance Engineering System

From "it's slow" to "here's why and here's the fix" — a complete methodology for measuring, diagnosing, optimizing, and preventing performance problems.

Phase 1: Performance Investigation Brief

Before touching anything, define the problem.

# performance-brief.yaml
investigation:
  reported_by: ""
  reported_date: ""
  system: ""              # service/app name
  environment: ""         # production, staging, dev

problem_statement:
  symptom: ""             # "API response time increased 3x"
  impact: ""              # "15% of users seeing timeouts"
  since_when: ""          # "After deploy v2.14 on Feb 20"
  affected_scope: ""      # "All endpoints" | "Only /search" | "Users in EU"

baselines:
  target_p50: ""          # e.g., "200ms"
  target_p95: ""          # e.g., "500ms"
  target_p99: ""          # e.g., "1000ms"
  current_p50: ""
  current_p95: ""
  current_p99: ""
  throughput_target: ""   # e.g., "1000 rps"
  error_rate_target: ""   # e.g., "<0.1%"

constraints:
  budget: ""              # time/money for optimization
  risk_tolerance: ""      # "Can we change the schema?" "Can we add caching?"
  deadline: ""            # "Must fix before Black Friday"

hypothesis:
  primary: ""             # "N+1 queries in the new recommendation engine"
  secondary: ""           # "Connection pool exhaustion under load"
  evidence: ""            # "Slow query log shows 200+ queries per request"

Performance Budget Framework

Set budgets BEFORE building, not after complaints:

Metric	Web App	API	Mobile	Batch Job
P50 response	<200ms	<100ms	<300ms	N/A
P95 response	<500ms	<250ms	<800ms	N/A
P99 response	<1s	<500ms	<1.5s	N/A
Error rate	<0.1%	<0.01%	<0.5%	<0.001%
Time to Interactive	<3s	N/A	<2s	N/A
Memory per request	<50MB	<20MB	<100MB	<1GB
CPU per request	<100ms	<50ms	<200ms	N/A
Throughput	100+ rps	500+ rps	N/A	items/min

Phase 2: Measurement & Profiling

The Golden Rule

Never optimize without measuring first. Never measure without a hypothesis.

Profiling Decision Tree

Is it slow?
├── YES → Where is time spent?
│   ├── CPU-bound → Profile CPU (flame graph)
│   │   ├── Hot function found → Optimize algorithm/data structure
│   │   └── Spread evenly → Architecture problem (too many layers)
│   ├── I/O-bound → Profile I/O
│   │   ├── Database → Query analysis (Phase 4)
│   │   ├── Network → Connection profiling
│   │   ├── Disk → I/O scheduler + buffering
│   │   └── External API → Caching + async + circuit breaker
│   ├── Memory-bound → Profile allocations
│   │   ├── GC pressure → Reduce allocations, pool objects
│   │   ├── Memory leak → Heap snapshot comparison
│   │   └── Cache thrashing → Resize or eviction policy
│   └── Concurrency-bound → Profile locks/contention
│       ├── Lock contention → Reduce critical section, lock-free structures
│       ├── Thread starvation → Pool sizing
│       └── Deadlock → Lock ordering analysis
└── NO → Define "fast enough" (see budgets above)

CPU Profiling by Language

Node.js

# Built-in profiler (V8)
node --prof app.js
node --prof-process isolate-*.log > profile.txt

# Inspector-based (connect Chrome DevTools)
node --inspect app.js
# Open chrome://inspect → Profiler → Start

# Clinic.js (best overall Node.js profiler)
npx clinic doctor -- node app.js
npx clinic flame -- node app.js    # Flame graph
npx clinic bubbleprof -- node app.js  # Async bottlenecks

# 0x (flame graphs)
npx 0x app.js

Python

# cProfile (built-in)
import cProfile
import pstats

profiler = cProfile.Profile()
profiler.enable()
# ... code to profile ...
profiler.disable()

stats = pstats.Stats(profiler)
stats.sort_stats('cumulative')
stats.print_stats(20)  # Top 20

# Line profiler (pip install line-profiler)
# Add @profile decorator, then:
# kernprof -l -v script.py

# py-spy (sampling profiler, no code changes)
# pip install py-spy
# py-spy top --pid <PID>
# py-spy record -o profile.svg --pid <PID>  # Flame graph

# Scalene (CPU + memory + GPU)
# pip install scalene
# scalene script.py

Go

// Built-in pprof
import (
    "net/http"
    _ "net/http/pprof"
    "runtime/pprof"
)

// HTTP server (add to existing server)
// Access: http://localhost:6060/debug/pprof/
go func() { http.ListenAndServe(":6060", nil) }()

// CLI analysis
// go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// go tool pprof -http=:8080 profile.out  # Web UI

Java

# async-profiler (best for JVM)
# https://github.com/async-profiler/async-profiler
./asprof -d 30 -f profile.html <PID>

# JFR (built-in since JDK 11)
java -XX:StartFlightRecording=duration=60s,filename=rec.jfr MyApp
jfr print --events CPULoad rec.jfr

# jstack (thread dump)
jstack <PID> > threads.txt

Memory Profiling

Leak Detection Pattern (any language)

1. Take heap snapshot at T0
2. Run suspected operation N times
3. Force GC
4. Take heap snapshot at T1
5. Compare: objects that grew = potential leak
6. Check: are they reachable? From where? (retention path)

Node.js Memory

// Heap snapshot
const v8 = require('v8');
const fs = require('fs');

function takeSnapshot(label) {
  const snapshotStream = v8.writeHeapSnapshot();
  console.log(`Heap snapshot written to ${snapshotStream}`);
}

// Process memory monitoring
setInterval(() => {
  const mem = process.memoryUsage();
  console.log({
    rss_mb: (mem.rss / 1048576).toFixed(1),
    heap_used_mb: (mem.heapUsed / 1048576).toFixed(1),
    heap_total_mb: (mem.heapTotal / 1048576).toFixed(1),
    external_mb: (mem.external / 1048576).toFixed(1),
  });
}, 10000);

Python Memory

# tracemalloc (built-in)
import tracemalloc

tracemalloc.start()
# ... code ...
snapshot = tracemalloc.take_snapshot()
top = snapshot.statistics('lineno')
for stat in top[:10]:
    print(stat)

# objgraph (pip install objgraph)
import objgraph
objgraph.show_most_common_types(limit=20)
objgraph.show_growth(limit=10)  # Call twice to see what's growing

Flame Graph Interpretation

Reading a flame graph:
┌─────────────────────────────────────────────┐
│                  main()                      │  ← Entry point (bottom)
├──────────────────────┬──────────────────────┤
│     processData()    │    renderOutput()     │  ← Width = time spent
├──────────┬───────────┤                      │
│ parseCSV │ validate  │                      │  ← Tall = deep call stack
├──────────┤           │                      │
│ readline │           │                      │  ← Top = where CPU burns
└──────────┴───────────┴──────────────────────┘

WHAT TO LOOK FOR:
1. Wide plateaus at top → CPU-intensive leaf function (optimize this!)
2. Many thin towers → excessive function calls (batch or reduce)
3. Recursive patterns → potential stack overflow risk
4. Unexpected width → function taking more time than expected
5. GC/runtime frames → memory pressure

ACTION RULES:
- Plateau >20% width → must investigate
- Plateau >40% width → almost certainly the bottleneck
- If top 3 functions = 80% of time → focused optimization will work
- If evenly distributed → architectural change needed

Phase 3: Common Optimization Patterns

Algorithm & Data Structure Optimizations

Problem	Bad O()	Fix	Good O()
Search unsorted array	O(n)	Sort + binary search, or use Set/Map	O(log n) or O(1)
Nested loop matching	O(n²)	Hash map lookup	O(n)
Repeated string concat	O(n²)	StringBuilder/join array	O(n)
Sorting already-sorted data	O(n log n)	Check if sorted first	O(n)
Finding duplicates	O(n²)	Set-based detection	O(n)
Frequent min/max of changing data	O(n) per query	Heap/priority queue	O(log n)

Caching Strategy Decision Matrix

Should you cache this?
├── Does the same input always produce the same output?
│   ├── YES → Cache candidate ✓
│   └── NO → Can you define a valid TTL?
│       ├── YES → Cache with TTL ✓
│       └── NO → Don't cache ✗
├── Is it called frequently?
│   ├── <10x/min → Probably not worth caching
│   └── >10x/min → Cache ✓
├── Is the source data expensive to compute/fetch?
│   ├── <10ms → Probably not worth caching
│   └── >10ms → Cache ✓
└── Does staleness cause problems?
    ├── Critical (financial, auth) → Short TTL or cache-aside with invalidation
    ├── Important (user data) → 1-5 min TTL with invalidation
    └── Tolerant (content, search) → 5-60 min TTL

CACHE LAYERS (use in order):
1. In-process (Map/LRU) → <1μs, limited by memory, per-instance
2. Shared cache (Redis/Memcached) → <1ms, shared across instances
3. CDN/edge cache → <10ms, geographic distribution
4. Browser cache → 0ms for user, stale risk

INVALIDATION STRATEGIES:
- TTL-based: simplest, best for read-heavy + staleness-tolerant
- Event-based: publish cache-invalidate on write, best for consistency
- Write-through: update cache on every write, best for write-read patterns
- Cache-aside: app manages cache explicitly, most flexible

Connection Pooling

# Sizing formula
pool_size: min(available_cores * 2 + effective_spindle_count, max_connections / num_instances)

# Rules of thumb:
# - PostgreSQL: connections = cores * 2 + 1 (per pgBouncer docs)
# - MySQL: keep total connections < 150 for most workloads
# - HTTP clients: match to concurrent request volume
# - Redis: usually 5-10 per instance is enough

# Warning signs of pool problems:
# - "connection timeout" errors under load
# - Response time spikes at regular intervals
# - Idle connections holding resources
# - Connection count hitting max_connections

Async & Concurrency Patterns

// BAD: Sequential when independent
const user = await getUser(id);
const orders = await getOrders(id);
const prefs = await getPreferences(id);
// Total: user_time + orders_time + prefs_time

// GOOD: Parallel when independent
const [user, orders, prefs] = await Promise.all([
  getUser(id),
  getOrders(id),
  getPreferences(id),
]);
// Total: max(user_time, orders_time, prefs_time)

// GOOD: Controlled concurrency for many items
// (npm: p-limit, p-map, or manual semaphore)
import pLimit from 'p-limit';
const limit = pLimit(10); // Max 10 concurrent
const results = await Promise.all(
  items.map(item => limit(() => processItem(item)))
);

# Python: asyncio for I/O-bound
import asyncio

async def fetch_all(ids):
    # Parallel
    tasks = [fetch_one(id) for id in ids]
    return await asyncio.gather(*tasks)

# Python: ProcessPoolExecutor for CPU-bound
from concurrent.futures import ProcessPoolExecutor
with ProcessPoolExecutor(max_workers=4) as pool:
    results = list(pool.map(cpu_intensive_fn, items))

N+1 Query Detection & Fix

SYMPTOM: Response time scales linearly with result count
DETECTION: Enable query logging, count queries per request

# Bad: N+1
users = db.query("SELECT * FROM users LIMIT 100")
for user in users:
    orders = db.query(f"SELECT * FROM orders WHERE user_id = {user.id}")
# Result: 1 + 100 = 101 queries

# Fix 1: JOIN
SELECT u.*, o.* FROM users u
LEFT JOIN orders o ON o.user_id = u.id
LIMIT 100

# Fix 2: Batch load (better for large datasets)
users = db.query("SELECT * FROM users LIMIT 100")
user_ids = [u.id for u in users]
orders = db.query(f"SELECT * FROM orders WHERE user_id IN ({','.join(user_ids)})")
# Result: 2 queries regardless of count

# Fix 3: ORM eager loading
# Drizzle: .with(users.orders)
# SQLAlchemy: joinedload(User.orders)
# Prisma: include: { orders: true }

Phase 4: Database Performance

Query Optimization Checklist

For every slow query:
□ Run EXPLAIN ANALYZE (not just EXPLAIN)
□ Check: is it doing a sequential scan on a large table?
□ Check: is the row estimate accurate? (bad stats = bad plan)
□ Check: are there implicit type casts preventing index use?
□ Check: is it sorting more data than needed? (add LIMIT earlier)
□ Check: is it joining in the right order?
□ Check: can a covering index eliminate table lookups?
□ Check: is the query running during peak hours? (schedule if batch)

EXPLAIN ANALYZE Interpretation

-- PostgreSQL EXPLAIN output reading guide:
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT ...;

-- Key metrics to check:
-- 1. Actual time vs estimated time (large gap = stale stats → ANALYZE)
-- 2. Rows actual vs estimated (>10x off = bad stats)
-- 3. Seq Scan on large table (>10K rows) = needs index
-- 4. Sort with external merge = needs more work_mem or index
-- 5. Nested Loop with large outer = consider hash/merge join
-- 6. Buffers shared hit vs read (low hit ratio = needs more shared_buffers)

Index Strategy Guide

WHEN TO ADD AN INDEX:
✓ WHERE clause column (equality or range)
✓ JOIN condition column
✓ ORDER BY column (if query is index-only scan candidate)
✓ Foreign key column (prevents table lock on parent delete)
✓ Column in a unique constraint

WHEN NOT TO ADD AN INDEX:
✗ Table has <1000 rows (seq scan is fine)
✗ Column has very low cardinality (boolean, status with 3 values)
✗ Write-heavy table where reads are rare
✗ You already have 8+ indexes on the table (diminishing returns, write penalty)

INDEX TYPES:
- B-tree (default): equality, range, sorting, LIKE 'prefix%'
- Hash: equality only (rarely better than B-tree)
- GIN: arrays, JSONB, full-text search
- GiST: geometry, range types, full-text
- BRIN: large tables with natural ordering (timestamps, sequential IDs)

COMPOSITE INDEX RULES:
1. Equality columns first, then range columns
2. Most selective column first (if all equality)
3. Index on (a, b) works for WHERE a=1 AND b=2 AND for WHERE a=1 alone
4. Index on (a, b) does NOT work for WHERE b=2 alone

Phase 5: Load Testing

Load Test Design

# load-test-plan.yaml
test_name: ""
target: ""              # URL/endpoint
date: ""

scenarios:
  - name: "Baseline"
    description: "Normal traffic pattern"
    vus: 50               # Virtual users
    duration: "5m"
    ramp_up: "30s"
    think_time: "1-3s"    # Pause between requests

  - name: "Peak"
    description: "2x normal traffic (expected peak)"
    vus: 100
    duration: "10m"
    ramp_up: "1m"

  - name: "Stress"
    description: "Find the breaking point"
    vus_start: 50
    vus_end: 500
    step_duration: "2m"   # Add users every 2 min
    step_size: 50

  - name: "Soak"
    description: "Memory leaks, connection exhaustion"
    vus: 50
    duration: "2h"

pass_criteria:
  p95_response_ms: 500
  error_rate_pct: 0.1
  throughput_rps: 200

k6 Load Test Template

// load-test.js (run: k6 run load-test.js)
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const errorRate = new Rate('errors');
const responseTime = new Trend('response_time');

export const options = {
  stages: [
    { duration: '30s', target: 20 },   // Ramp up
    { duration: '3m', target: 20 },    // Steady
    { duration: '30s', target: 50 },   // Peak
    { duration: '3m', target: 50 },    // Steady peak
    { duration: '30s', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],   // 95% under 500ms
    errors: ['rate<0.01'],              // <1% error rate
  },
};

export default function () {
  const res = http.get('https://api.example.com/endpoint');

  check(res, {
    'status 200': (r) => r.status === 200,
    'response < 500ms': (r) => r.timings.duration < 500,
  });

  errorRate.add(res.status !== 200);
  responseTime.add(res.timings.duration);

  sleep(Math.random() * 2 + 1); // 1-3s think time
}

Load Test Results Analysis

READING RESULTS:
┌──────────────────────────────────────────┐
│ Metric          │ Healthy │ Warning │ Bad│
├──────────────────────────────────────────┤
│ p50/p95 ratio   │ <2x     │ 2-5x    │>5x│  ← High ratio = tail latency problem
│ p95/p99 ratio   │ <2x     │ 2-3x    │>3x│  ← Outliers affecting some users
│ Error rate      │ <0.1%   │ 0.1-1%  │>1%│  ← Above 1% = user-visible
│ Throughput drop  │ <5%     │ 5-20%   │>20%│ ← System under stress
│ CPU at peak     │ <70%    │ 70-85%  │>85%│ ← No headroom
│ Memory at peak  │ <75%    │ 75-90%  │>90%│ ← Risk of OOM
│ GC pause time   │ <50ms   │ 50-200ms│>200ms│ ← GC storm
└──────────────────────────────────────────┘

BOTTLENECK IDENTIFICATION:
- Throughput plateaus but CPU is low → I/O bound (DB, network, disk)
- Throughput plateaus and CPU is high → CPU bound (optimize hot path)
- Response time climbs linearly → Queue building (capacity limit)
- Response time climbs exponentially → Resource exhaustion (connection pool, memory)
- Errors spike at specific VU count → Hard limit hit (max connections, file descriptors)

Phase 6: Frontend Performance

Core Web Vitals Optimization

METRIC      │ GOOD    │ NEEDS WORK │ POOR   │ HOW TO FIX
────────────┼─────────┼────────────┼────────┼────────────────────────
LCP         │ <2.5s   │ 2.5-4s     │ >4s    │ Optimize largest image/text
FID/INP     │ <100ms  │ 100-300ms  │ >300ms │ Break up long tasks, defer JS
CLS         │ <0.1    │ 0.1-0.25   │ >0.25  │ Set dimensions, font-display

LCP FIXES (in priority order):
1. Preload the LCP image: <link rel="preload" as="image" href="...">
2. Use responsive images: srcset with correct sizes
3. Serve WebP/AVIF (30-50% smaller)
4. Remove render-blocking CSS/JS from <head>
5. Use CDN for static assets
6. Server-side render the above-fold content

INP FIXES:
1. Break long tasks (>50ms) with requestIdleCallback or setTimeout(0)
2. Use web workers for CPU-intensive work
3. Debounce/throttle event handlers
4. Defer non-critical JS: <script defer> or dynamic import()
5. Avoid layout thrashing (batch DOM reads, then batch writes)

CLS FIXES:
1. Always set width/height on <img> and <video>
2. Use aspect-ratio CSS for dynamic content
3. Reserve space for ads/embeds
4. Use font-display: swap with size-adjusted fallback
5. Never insert content above existing content

Bundle Optimization

ANALYSIS:
- Webpack: npx webpack-bundle-analyzer stats.json
- Vite: npx vite-bundle-visualizer
- Next.js: @next/bundle-analyzer

REDUCTION STRATEGIES (in order of impact):
1. Code splitting: dynamic import() for routes and heavy components
2. Tree shaking: use ESM imports, avoid barrel files (index.ts re-exports)
3. Replace heavy libraries:
   - moment.js (330KB) → date-fns (tree-shakeable) or dayjs (2KB)
   - lodash (530KB) → lodash-es (tree-shakeable) or native JS
   - chart.js → lightweight alternative for simple charts
4. Lazy load below-fold components
5. Externalize large deps to CDN (React, etc.)
6. Compress: Brotli > gzip (15-20% smaller)

Phase 7: Infrastructure & Scaling

Scaling Decision Framework

VERTICAL SCALING (scale up):
✓ Quick fix, no code changes
✓ Database servers (often best first move)
✓ Memory-bound workloads
✗ Diminishing returns past 8-16 cores
✗ Single point of failure
✗ Expensive at high end

HORIZONTAL SCALING (scale out):
✓ Stateless services (APIs, workers)
✓ Read-heavy workloads (read replicas)
✓ Geographic distribution
✗ Requires stateless design
✗ Adds complexity (load balancing, session management)
✗ Not all workloads parallelize

SCALING CHECKLIST:
□ Can we optimize the code first? (cheapest option)
□ Can we add caching? (often 10-100x improvement)
□ Can we add a read replica? (if read-heavy)
□ Can we queue and process async? (if latency-tolerant)
□ Can we scale vertically? (if CPU/memory bound)
□ Do we need horizontal scaling? (if all above exhausted)

Auto-scaling Configuration

# Kubernetes HPA example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70    # Scale at 70% CPU
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60    # Wait 1m before scaling up
      policies:
        - type: Percent
          value: 50                      # Max 50% increase per step
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300   # Wait 5m before scaling down
      policies:
        - type: Percent
          value: 25                      # Max 25% decrease per step
          periodSeconds: 120

Phase 8: Capacity Planning

Capacity Model Template

# capacity-model.yaml
service: ""
last_updated: ""

current_state:
  daily_requests: 0
  peak_rps: 0
  avg_response_ms: 0
  instances: 0
  cpu_peak_pct: 0
  memory_peak_pct: 0
  db_connections_peak: 0
  storage_used_gb: 0

growth_model:
  request_growth_monthly_pct: 0    # e.g., 15%
  storage_growth_monthly_gb: 0
  seasonal_peak_multiplier: 0      # e.g., 3x for Black Friday

projections:
  # Formula: current * (1 + growth_rate)^months * seasonal_multiplier
  3_month:
    daily_requests: 0
    peak_rps: 0
    instances_needed: 0
    storage_gb: 0
    estimated_cost: ""
  6_month:
    daily_requests: 0
    peak_rps: 0
    instances_needed: 0
    storage_gb: 0
    estimated_cost: ""
  12_month:
    daily_requests: 0
    peak_rps: 0
    instances_needed: 0
    storage_gb: 0
    estimated_cost: ""

headroom_rules:
  cpu: "Scale when sustained >70% for 5m"
  memory: "Scale when >80%"
  storage: "Alert when >75%, expand when >85%"
  db_connections: "Alert when >80% of max"

Cost-Performance Tradeoff Analysis

For every optimization, calculate:

ROI = (time_saved_per_month × cost_per_hour) / implementation_cost

EXAMPLE:
- P95 latency: 800ms → 200ms after optimization
- Requests/month: 10M
- Time saved: 600ms × 10M = 1,667 hours of compute
- Compute cost: $0.05/hour = $83/month savings
- Implementation: 16 hours × $150/hr = $2,400
- Payback: 29 months ← NOT WORTH IT for cost alone

BUT ALSO CONSIDER:
- User experience improvement → conversion rate
- Reduced infrastructure needs → fewer instances
- Headroom for growth → delayed scaling investment
- Developer productivity → faster local dev cycles

Phase 9: Performance in CI/CD

Automated Performance Gates

# .github/workflows/perf-gate.yml
name: Performance Gate
on: pull_request

jobs:
  benchmark:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Run benchmarks
        run: |
          # Run your benchmark suite
          npm run benchmark -- --json > bench-results.json

      - name: Compare with baseline
        run: |
          # Compare against main branch baseline
          node scripts/compare-benchmarks.js \
            --baseline benchmarks/baseline.json \
            --current bench-results.json \
            --threshold 10  # Fail if >10% regression

      - name: Load test (on staging)
        if: github.base_ref == 'main'
        run: |
          k6 run --out json=load-results.json tests/load-test.js
          # Check thresholds automatically via k6

      - name: Bundle size check
        run: |
          npm run build
          node scripts/check-bundle-size.js \
            --max-size 250KB \
            --max-increase 5%

Performance Regression Detection

AUTOMATED CHECKS (run on every PR):
□ Unit benchmarks: critical path functions < threshold
□ Bundle size: total and per-chunk limits
□ Lighthouse CI: Core Web Vitals pass
□ Query count: no N+1 regressions (count queries per test)
□ Memory: no leak patterns in test suite

WEEKLY CHECKS (cron job):
□ Production p50/p95/p99 trends (compare to 4-week average)
□ Error rate trends
□ Database slow query log review
□ Infrastructure cost vs traffic ratio
□ Cache hit rates

MONTHLY REVIEW:
□ Capacity model update
□ Performance budget review
□ Top 10 slowest endpoints → optimization candidates
□ Cost-performance analysis
□ Load test full suite against staging

Phase 10: Performance Culture

Performance Review Checklist

Score your system (0-100):

MEASUREMENT (25 points):
□ (5) Performance budgets defined for all key metrics
□ (5) Real User Monitoring (RUM) in production
□ (5) Alerting on p95 degradation
□ (5) Dashboards visible to team
□ (5) Regular load testing

PREVENTION (25 points):
□ (5) Performance gates in CI/CD
□ (5) Bundle size limits enforced
□ (5) Query count checks in tests
□ (5) Code review includes perf review
□ (5) Capacity planning model maintained

OPTIMIZATION (25 points):
□ (5) Caching strategy documented
□ (5) Database indexes reviewed quarterly
□ (5) No known N+1 queries
□ (5) Connection pools properly sized
□ (5) Async patterns used for I/O

OPERATIONS (25 points):
□ (5) Auto-scaling configured and tested
□ (5) Slow query logging enabled
□ (5) Memory leak monitoring
□ (5) Performance incident runbook exists
□ (5) Monthly performance review

Common Anti-Patterns

1. PREMATURE OPTIMIZATION
   Problem: Optimizing before measuring
   Fix: Profile first, optimize the measured bottleneck

2. MICRO-BENCHMARKING IN ISOLATION
   Problem: Function is fast alone but slow in context (cache, contention)
   Fix: Always benchmark in realistic conditions with realistic data

3. OPTIMIZING THE WRONG LAYER
   Problem: Tuning app code when the DB is the bottleneck
   Fix: Use distributed tracing to find the actual bottleneck

4. CACHING EVERYTHING
   Problem: Cache invalidation bugs, stale data, memory pressure
   Fix: Cache selectively using the decision matrix (Phase 3)

5. PREMATURE HORIZONTAL SCALING
   Problem: Adding instances when single instance is underoptimized
   Fix: Vertical optimization first, scale second

6. IGNORING TAIL LATENCY
   Problem: p50 is fine but p99 is terrible
   Fix: Investigate outliers — they're often the most important users

7. LOAD TESTING IN DEV
   Problem: Dev environment doesn't match production
   Fix: Load test against staging with production-like data

8. OPTIMIZING COLD PATHS
   Problem: Spending time on rarely-executed code
   Fix: Profile in production to find actual hot paths

Quick Reference: Tool Selection

Task	Recommended Tool	Alternative
HTTP benchmarking	k6	wrk, ab, hey
CPU profiling (Node)	clinic flame	0x, --prof
CPU profiling (Python)	py-spy	Scalene, cProfile
CPU profiling (Go)	pprof	go tool trace
CPU profiling (Java)	async-profiler	JFR, VisualVM
Memory profiling	language-specific (see Phase 2)
CLI benchmarking	hyperfine	time
Bundle analysis	webpack-bundle-analyzer	source-map-explorer
Web performance	Lighthouse	WebPageTest
DB query analysis	EXPLAIN ANALYZE	pgMustard, pganalyze
Distributed tracing	Jaeger, Zipkin	OpenTelemetry
APM	Datadog, New Relic	Grafana + Prometheus
Continuous profiling	Pyroscope	Parca

Natural Language Commands

"Profile this function"     → CPU profiling with flame graph
"Why is this endpoint slow" → Full investigation brief + profiling
"Load test the API"         → k6 test design and execution
"Check for memory leaks"    → Heap snapshot comparison workflow
"Optimize this query"       → EXPLAIN ANALYZE + index recommendations
"Review frontend perf"      → Core Web Vitals audit + bundle analysis
"Plan capacity for 10x"     → Capacity model with projections
"Set up perf monitoring"    → CI/CD gates + dashboards + alerts
"Find the bottleneck"       → Profiling decision tree walkthrough
"Score our performance"     → Performance review checklist (0-100)
"Compare before and after"  → Benchmark comparison methodology
"Reduce bundle size"        → Bundle analysis + reduction strategies