Observability
Three Pillars
- Logs
Discrete events with context.
{ "timestamp": "2024-01-01T12:00:00Z", "level": "error", "message": "Failed to process order", "orderId": "123", "error": "Payment declined", "traceId": "abc123" }
- Metrics
Numeric measurements over time.
http_requests_total{method="GET", status="200"} 1234 http_request_duration_seconds{quantile="0.95"} 0.23
- Traces
Request flow through services.
Trace: abc123 ├── API Gateway (50ms) │ ├── Auth Service (10ms) │ └── Order Service (35ms) │ └── Database (20ms)
OpenTelemetry Setup
import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://collector:4318/v1/traces', }), serviceName: 'my-service', });
sdk.start();
Key Metrics
RED Method (Request-focused)
-
Rate: Requests per second
-
Errors: Failed requests per second
-
Duration: Request latency
USE Method (Resource-focused)
-
Utilization: % time busy
-
Saturation: Queue depth
-
Errors: Error count
Alerting
Good Alerts
-
Actionable: Something can be done
-
Urgent: Needs immediate attention
-
Specific: Clear what's wrong
Alert Template
alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }}"
Dashboards
Essential panels:
-
Request rate
-
Error rate
-
Latency (P50, P95, P99)
-
Saturation (CPU, memory)
-
Active alerts