observability

Observability

Three Pillars

Logs

Discrete events with context.

{ "timestamp": "2024-01-01T12:00:00Z", "level": "error", "message": "Failed to process order", "orderId": "123", "error": "Payment declined", "traceId": "abc123" }

Metrics

Numeric measurements over time.

http_requests_total{method="GET", status="200"} 1234 http_request_duration_seconds{quantile="0.95"} 0.23

Traces

Request flow through services.

Trace: abc123 ├── API Gateway (50ms) │ ├── Auth Service (10ms) │ └── Order Service (35ms) │ └── Database (20ms)

OpenTelemetry Setup

import { NodeSDK } from '@opentelemetry/sdk-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: 'http://collector:4318/v1/traces', }), serviceName: 'my-service', });

sdk.start();

Key Metrics

RED Method (Request-focused)

Rate: Requests per second
Errors: Failed requests per second
Duration: Request latency

USE Method (Resource-focused)

Utilization: % time busy
Saturation: Queue depth
Errors: Error count

Alerting

Good Alerts

Actionable: Something can be done
Urgent: Needs immediate attention
Specific: Clear what's wrong

Alert Template

alert: HighErrorRate expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.01 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }}"

Dashboards

Essential panels:

Request rate
Error rate
Latency (P50, P95, P99)
Saturation (CPU, memory)
Active alerts

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

react-native-reanimated

cloud-native-patterns

brainstorming