Observability - Monitoring, Logging & Tracing
Production patterns for the three pillars: metrics, logs, and traces
When to Use This Skill
Use this skill when:
-
Setting up application monitoring
-
Implementing structured logging
-
Adding error tracking (Sentry, Bugsnag)
-
Configuring distributed tracing
-
Building health check endpoints
-
Creating alerting rules
Don't use this skill when:
-
Development/local debugging only
-
Using managed platforms with built-in observability
Critical Patterns
Pattern 1: Structured Logging
When: Creating queryable, parseable logs
// ✅ GOOD: Structured logging with pino import pino from 'pino';
const logger = pino({ level: process.env.LOG_LEVEL || 'info', formatters: { level: (label) => ({ level: label }), bindings: (bindings) => ({ pid: bindings.pid, hostname: bindings.hostname, service: process.env.SERVICE_NAME, }), }, timestamp: pino.stdTimeFunctions.isoTime, });
// Create request-scoped logger function createRequestLogger(requestId: string, userId?: string) { return logger.child({ requestId, userId, }); }
// Usage in handlers async function handleRequest(req: Request) { const log = createRequestLogger(req.id, req.user?.id);
log.info({ path: req.url, method: req.method }, 'Request started');
try { const result = await processRequest(req); log.info({ duration: Date.now() - req.startTime }, 'Request completed'); return result; } catch (error) { log.error({ error: error.message, stack: error.stack, duration: Date.now() - req.startTime, }, 'Request failed'); throw error; } }
// ❌ BAD: Unstructured console.log console.log('User ' + userId + ' did ' + action); // Can't query! console.log(error); // Object won't serialize properly
Pattern 2: Error Tracking
When: Capturing and analyzing production errors
// ✅ GOOD: Sentry integration import * as Sentry from '@sentry/node';
Sentry.init({ dsn: process.env.SENTRY_DSN, environment: process.env.NODE_ENV, release: process.env.GIT_SHA, tracesSampleRate: 0.1, // 10% of transactions integrations: [ new Sentry.Integrations.Http({ tracing: true }), new Sentry.Integrations.Prisma({ client: prisma }), ], beforeSend(event, hint) { // Filter out expected errors const error = hint.originalException; if (error instanceof AppError && error.isOperational) { return null; // Don't send operational errors } return event; }, });
// Capture with context function captureError(error: Error, context?: Record<string, any>) { Sentry.withScope((scope) => { if (context) { scope.setExtras(context); } if (context?.userId) { scope.setUser({ id: context.userId }); } Sentry.captureException(error); }); }
// Usage in error handler app.use((err, req, res, next) => { captureError(err, { requestId: req.id, userId: req.user?.id, path: req.path, method: req.method, body: req.body, });
res.status(500).json({ error: 'Internal server error' }); });
// ❌ BAD: No error tracking app.use((err, req, res, next) => { console.error(err); // Lost when container restarts! res.status(500).json({ error: err.message }); // Exposes internals });
Pattern 3: Health Checks
When: Monitoring application and dependency health
// ✅ GOOD: Comprehensive health check // app/api/health/route.ts import { NextResponse } from 'next/server';
interface HealthCheck { status: 'healthy' | 'degraded' | 'unhealthy'; checks: Record<string, CheckResult>; version: string; uptime: number; }
interface CheckResult { status: 'pass' | 'fail'; latency?: number; message?: string; }
async function checkDatabase(): Promise<CheckResult> {
const start = Date.now();
try {
await prisma.$queryRawSELECT 1;
return { status: 'pass', latency: Date.now() - start };
} catch (error) {
return { status: 'fail', message: error.message };
}
}
async function checkRedis(): Promise<CheckResult> { const start = Date.now(); try { await redis.ping(); return { status: 'pass', latency: Date.now() - start }; } catch (error) { return { status: 'fail', message: error.message }; } }
async function checkExternalAPI(): Promise<CheckResult> { const start = Date.now(); try { const res = await fetch(process.env.EXTERNAL_API_URL + '/health', { signal: AbortSignal.timeout(5000), }); return { status: res.ok ? 'pass' : 'fail', latency: Date.now() - start, }; } catch (error) { return { status: 'fail', message: error.message }; } }
export async function GET() { const checks = { database: await checkDatabase(), redis: await checkRedis(), externalApi: await checkExternalAPI(), };
const allPassing = Object.values(checks).every(c => c.status === 'pass'); const anyFailing = Object.values(checks).some(c => c.status === 'fail');
const health: HealthCheck = { status: allPassing ? 'healthy' : anyFailing ? 'unhealthy' : 'degraded', checks, version: process.env.GIT_SHA || 'unknown', uptime: process.uptime(), };
return NextResponse.json(health, { status: health.status === 'healthy' ? 200 : 503, }); }
// ❌ BAD: Simple health check that always passes export async function GET() { return NextResponse.json({ status: 'ok' }); // Doesn't check anything! }
Pattern 4: Distributed Tracing
When: Tracking requests across services
// ✅ GOOD: OpenTelemetry tracing import { NodeSDK } from '@opentelemetry/sdk-node'; import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'; import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({ traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT, }), instrumentations: [getNodeAutoInstrumentations()], serviceName: process.env.SERVICE_NAME, });
sdk.start();
// Manual span creation import { trace, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('my-service');
async function processOrder(orderId: string) { return tracer.startActiveSpan('processOrder', async (span) => { span.setAttribute('order.id', orderId);
try {
// Child spans are automatically linked
const order = await fetchOrder(orderId);
span.setAttribute('order.total', order.total);
await tracer.startActiveSpan('validateInventory', async (childSpan) => {
await validateInventory(order.items);
childSpan.end();
});
await tracer.startActiveSpan('processPayment', async (childSpan) => {
await processPayment(order);
childSpan.end();
});
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
span.recordException(error);
throw error;
} finally {
span.end();
}
}); }
// Propagate trace context to external services async function callExternalService(data: any) { const headers: Record<string, string> = {};
// Inject trace context into headers propagation.inject(context.active(), headers);
return fetch(EXTERNAL_SERVICE_URL, { method: 'POST', headers: { 'Content-Type': 'application/json', ...headers, // Includes traceparent header }, body: JSON.stringify(data), }); }
Pattern 5: Metrics Collection
When: Measuring application performance and business metrics
// ✅ GOOD: Prometheus metrics import { Registry, Counter, Histogram, Gauge } from 'prom-client';
const registry = new Registry();
// HTTP request metrics const httpRequestsTotal = new Counter({ name: 'http_requests_total', help: 'Total number of HTTP requests', labelNames: ['method', 'path', 'status'], registers: [registry], });
const httpRequestDuration = new Histogram({ name: 'http_request_duration_seconds', help: 'HTTP request duration in seconds', labelNames: ['method', 'path'], buckets: [0.01, 0.05, 0.1, 0.5, 1, 5], registers: [registry], });
// Business metrics const ordersTotal = new Counter({ name: 'orders_total', help: 'Total number of orders', labelNames: ['status'], registers: [registry], });
const activeUsers = new Gauge({ name: 'active_users', help: 'Number of currently active users', registers: [registry], });
// Middleware to track requests app.use((req, res, next) => { const start = Date.now();
res.on('finish', () => { const duration = (Date.now() - start) / 1000; const path = req.route?.path || req.path;
httpRequestsTotal.inc({
method: req.method,
path,
status: res.statusCode,
});
httpRequestDuration.observe(
{ method: req.method, path },
duration
);
});
next(); });
// Expose metrics endpoint app.get('/metrics', async (req, res) => { res.set('Content-Type', registry.contentType); res.send(await registry.metrics()); });
// Track business events async function createOrder(data: CreateOrderInput) { const order = await db.order.create({ data }); ordersTotal.inc({ status: 'created' }); return order; }
Code Examples
For complete, production-ready examples, see references/examples.md:
-
Request Logging Middleware
-
Dashboard Alerts Configuration (Prometheus)
-
Error Boundary with Reporting
-
Custom Metrics (Prometheus)
Anti-Patterns
Don't: Log Sensitive Data
// ❌ BAD: Logging passwords, tokens, PII logger.info({ user: { email, password } }, 'User login'); logger.info({ authorization: req.headers.authorization }, 'Request');
// ✅ GOOD: Redact sensitive fields logger.info({ userId: user.id, email: user.email }, 'User login'); logger.info({ hasAuth: !!req.headers.authorization }, 'Request');
Don't: Sample Everything at 100%
// ❌ BAD: Trace every request tracesSampleRate: 1.0, // Very expensive at scale!
// ✅ GOOD: Sample appropriately tracesSampleRate: 0.1, // 10% of transactions // Or use dynamic sampling based on endpoint
Don't: Ignore Alert Fatigue
// ❌ BAD: Alert on every error if (error) sendAlert(error); // Gets ignored due to noise
// ✅ GOOD: Alert on actionable thresholds // Only alert when error rate exceeds 5% for 5 minutes
Quick Reference
Pillar Tool Use Case
Logs Pino, Winston Structured application logs
Metrics Prometheus Request counts, latencies
Traces OpenTelemetry Distributed request flow
Errors Sentry Exception tracking
APM Datadog, New Relic Full observability suite
Resources
Related Skills:
-
error-handling: Exception patterns
-
performance: Optimization metrics
-
ci-cd: Deployment monitoring
Keywords
observability , monitoring , logging , tracing , metrics , sentry , prometheus , opentelemetry , health-check , alerting