observability & monitoring

Observability & Monitoring Skill

Comprehensive frameworks for implementing observability including structured logging, metrics, distributed tracing, and alerting.

Overview

Setting up application monitoring
Implementing structured logging
Adding metrics and dashboards
Configuring distributed tracing
Creating alerting rules
Debugging production issues

Three Pillars of Observability

References

Logging Patterns

See: references/logging-patterns.md

Key topics covered:

Correlation IDs for cross-service request tracking
Log sampling strategies for high-traffic systems
LogQL queries for Loki log aggregation
OrchestKit structlog configuration example

Metrics Collection

See: references/metrics-collection.md

Key topics covered:

Counter, Gauge, Histogram, Summary metric types
Cardinality management and limits
Custom business metrics (LLM tokens, cache hit rates)
LLM cost tracking with Prometheus

Distributed Tracing

See: references/distributed-tracing.md

Key topics covered:

OpenTelemetry setup and auto-instrumentation
Span relationships (parent/child, parallel)
Head-based and tail-based sampling strategies
Trace context propagation across services

Alerting and Dashboards

See: references/alerting-dashboards.md

Key topics covered:

Alert severity levels and response times
Alert grouping and inhibition rules
Escalation policies and runbook links
Golden Signals dashboard design
SLO/SLI definitions and error budgets

Quick Reference

Log Levels

Level Use Case

ERROR Unhandled exceptions, failed operations

WARN Deprecated API, retry attempts

INFO Business events, successful operations

DEBUG Development troubleshooting

RED Method (Rate, Errors, Duration)

Essential metrics for any service:

Rate - Requests per second
Errors - Failed requests per second
Duration - Request latency distribution

Prometheus Buckets

// HTTP request latency buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5]

// Database query latency buckets: [0.001, 0.01, 0.05, 0.1, 0.5, 1]

Key Alerts

Alert Condition Severity

ServiceDown up == 0 for 1m Critical

HighErrorRate 5xx > 5% for 5m Critical

HighLatency p95 > 2s for 5m High

LowCacheHitRate < 70% for 10m Medium

Health Checks (Kubernetes)

Probe Purpose Endpoint

Liveness Is app running? /health

Readiness Ready for traffic? /ready

Startup Finished starting? /startup

Observability Checklist

Implementation

JSON structured logging
Request correlation IDs
RED metrics (Rate, Errors, Duration)
Business metrics
Distributed tracing
Health check endpoints

Alerting

Service outage alerts
Error rate thresholds
Latency thresholds
Resource utilization alerts

Dashboards

Service overview
Error analysis
Performance metrics

Templates Reference

Template Purpose

structured-logging.ts

Winston logger with request middleware

prometheus-metrics.ts

HTTP, DB, cache metrics with middleware

opentelemetry-tracing.ts

Distributed tracing setup

alerting-rules.yml

Prometheus alerting rules

health-checks.ts

Liveness, readiness, startup probes

Langfuse Integration

For LLM observability, use Langfuse decorators:

from langfuse.decorators import observe, langfuse_context

@observe(name="analyze_content") async def analyze_content(url: str) -> AnalysisResult: langfuse_context.update_current_trace( name="content_analysis", user_id="system", metadata={"url": url} ) # ... workflow implementation

See examples/orchestkit-monitoring-dashboard.md for real-world examples.

Extended Thinking Triggers

Use Opus 4.6 adaptive thinking for:

Incident investigation - Correlating logs, metrics, traces
Alert tuning - Reducing noise, catching real issues
Architecture decisions - Choosing monitoring solutions
Performance debugging - Cross-service latency analysis

Related Skills

defense-in-depth
Layer 8 observability as part of security architecture
devops-deployment
Observability integration with CI/CD and Kubernetes
resilience-patterns
Monitoring circuit breakers and failure scenarios
fastapi-advanced
FastAPI-specific middleware for logging and metrics

Key Decisions

Decision Choice Rationale

Log format Structured JSON Machine-parseable, supports log aggregation, enables queries

Metric types RED method (Rate, Errors, Duration) Industry standard, covers essential service health indicators

Tracing OpenTelemetry Vendor-neutral, auto-instrumentation, broad ecosystem support

Alerting severity 4 levels (Critical, High, Medium, Low) Clear escalation paths, appropriate response times

Capability Details

structured-logging

Keywords: logging, structured log, json log, correlation id, log level, winston, pino, structlog Solves:

How do I set up structured logging?
Implement correlation IDs across services
JSON logging best practices

correlation-tracking

Keywords: correlation id, request tracking, trace context, distributed logs Solves:

How do I track requests across services?
Implement correlation IDs in middleware
Find all logs for a single request

log-sampling

Keywords: log sampling, high traffic logging, sampling rate, log volume Solves:

How do I reduce log volume in production?
Sample INFO logs while keeping all errors

prometheus-metrics

Keywords: metrics, prometheus, counter, histogram, gauge, summary, red method Solves:

How do I collect application metrics?
Implement RED method (Rate, Errors, Duration)
Choose between Counter, Gauge, Histogram

metric-types

Keywords: counter, gauge, histogram, summary, bucket, quantile Solves:

When to use Counter vs Gauge?
Histogram vs Summary for latency
Configure histogram buckets

cardinality-management

Keywords: cardinality, label explosion, time series, prometheus performance Solves:

How do I prevent label cardinality explosions?
Fix unbounded labels (user IDs, request IDs)

distributed-tracing

Keywords: tracing, distributed tracing, opentelemetry, span, trace id, waterfall Solves:

How do I implement distributed tracing?
OpenTelemetry setup with auto-instrumentation
Create manual spans for custom operations

trace-sampling

Keywords: trace sampling, head-based sampling, tail-based sampling Solves:

How do I reduce trace volume?
Sample 10% of traces but keep all errors

alerting-strategy

Keywords: alert, alerting, notification, threshold, pagerduty, slack, severity Solves:

How do I set up effective alerts?
Define alert severity levels (P1-P4)

alert-fatigue-prevention

Keywords: alert fatigue, alert grouping, inhibition, escalation Solves:

How do I reduce alert noise?
Group related alerts together

dashboards

Keywords: dashboard, visualization, grafana, golden signals, red method Solves:

How do I create monitoring dashboards?
Design Golden Signals dashboard layout
Build SLO/SLI dashboards

health-checks

Keywords: health check, liveness, readiness, startup probe, kubernetes Solves:

How do I implement health check endpoints?
Difference between liveness and readiness

langfuse-observability

Keywords: langfuse, llm observability, llm tracing, token usage, llm cost tracking Solves:

How do I monitor LLM calls with Langfuse?
Track LLM token usage and cost

llm-cost-tracking

Keywords: llm cost, token tracking, cost optimization, prometheus llm metrics Solves:

How do I track LLM costs with Prometheus?
Measure token usage by model and operation

observability & monitoring

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

responsive-patterns

domain-driven-design

dashboard-patterns

rag-retrieval