SRE Monitoring and Observability
Building comprehensive monitoring and observability systems.
Four Golden Signals
Latency
Time to process requests:
Request duration
http_request_duration_seconds
Query
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )
Traffic
Demand on the system:
Requests per second
rate(http_requests_total[5m])
By endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
Errors
Rate of failed requests:
Error rate
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
SLI compliance
1 - (error_rate / slo_target)
Saturation
Resource utilization:
CPU usage
100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Memory usage
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100
Service Level Indicators (SLIs)
Availability SLI
Successful requests / Total requests
sum(rate(http_requests_total{status=~"[23].."}[30d])) / sum(rate(http_requests_total[30d]))
Latency SLI
Requests faster than threshold / Total requests
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) / sum(rate(http_request_duration_seconds_count[30d]))
Throughput SLI
Requests processed within capacity
clamp_max( rate(http_requests_total[5m]) / capacity_requests_per_second, 1.0 )
Alerting
Alert Severity Levels
P0 - Critical: Service down or severe degradation P1 - High: Significant impact, error budget at risk
P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed
Example Alerts
High error rate
groups:
- name: sre
rules:
-
alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])
0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"
-
alert: LatencyP95High expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1.0 for: 10m labels: severity: warning
-
alert: ErrorBudgetBurn expr: | (1 - sli_availability) > (error_budget_remaining * 10) for: 1h labels: severity: high
-
Dashboards
Overview Dashboard
-
Service health (red/yellow/green)
-
Request rate
-
Error rate
-
Latency percentiles (p50, p95, p99)
-
Saturation metrics
Detailed Dashboard
-
Per-endpoint metrics
-
Dependency health
-
Database performance
-
Cache hit rates
-
Queue depths
Distributed Tracing
OpenTelemetry
const { trace } = require('@opentelemetry/api'); const tracer = trace.getTracer('my-service');
async function handleRequest(req) { const span = tracer.startSpan('handle_request');
try { span.setAttribute('user.id', req.user.id); span.setAttribute('request.path', req.path);
const result = await processRequest(req);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message, }); throw error; } finally { span.end(); } }
Structured Logging
logger.info('request_processed', { request_id: req.id, user_id: req.user.id, endpoint: req.path, method: req.method, status_code: res.statusCode, duration_ms: duration, error: error?.message, });
Best Practices
USE Method
For resources:
-
Utilization: % time resource is busy
-
Saturation: Work queued but not serviced
-
Errors: Error count
RED Method
For requests:
-
Rate: Requests per second
-
Errors: Failed requests per second
-
Duration: Request latency distribution
Alert on Symptoms, Not Causes
Good - alert on user impact
- alert: HighLatency expr: p95_latency > 1s
Bad - alert on potential cause
- alert: HighCPU expr: cpu_usage > 80%
Runbook Links
annotations: runbook: "https://wiki.example.com/runbooks/high-error-rate" dashboard: "https://grafana.example.com/d/abc123"