distributed-tracing

Distributed Tracing

Patterns and practices for implementing distributed tracing across microservices and understanding request flows in distributed systems.

When to Use This Skill

Implementing distributed tracing in microservices
Debugging cross-service request issues
Understanding trace propagation
Choosing tracing infrastructure
Correlating logs, metrics, and traces

Why Distributed Tracing?

Problem: Request flows through multiple services How do you debug when something fails?

Without tracing: User → API → ??? → ??? → Error somewhere

With tracing: User → API (50ms) → OrderService (20ms) → PaymentService (ERROR: timeout) └── Full visibility into request flow

Core Concepts

Traces, Spans, and Context

Trace: End-to-end request journey ├── Span: Single operation within a service │ ├── SpanID: Unique identifier │ ├── ParentSpanID: Link to parent span │ ├── TraceID: Shared across all spans │ ├── Operation Name: What is being done │ ├── Start/End Time: Duration │ ├── Status: Success/Error │ ├── Attributes: Key-value metadata │ └── Events: Point-in-time annotations │ └── Context: Propagated across service boundaries ├── TraceID ├── SpanID ├── Trace Flags └── Trace State

Trace Visualization

TraceID: abc123

Service A (API Gateway) ├──────────────────────────────────────────────────────┤ 200ms │ └─► Service B (Order Service) ├───────────────────────────────────┤ 150ms │ ├─► Service C (Inventory) │ ├───────────────┤ 50ms │ └─► Service D (Payment) ├───────────────────────┤ 80ms │ └─► External API ├─────────┤ 60ms

OpenTelemetry

Overview

OpenTelemetry = Unified observability framework

Components: ┌─────────────────────────────────────────────────────┐ │ Application │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ SDK │ │ Tracer │ │ Meter │ │ │ │ │ │ Provider │ │ Provider │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────┘ │ │ │ └───────────────┼───────────────┘ ▼ ┌─────────────────────────┐ │ OTLP Exporter │ └─────────────────────────┘ │ ▼ ┌─────────────────────────┐ │ Collector │ │ (Optional) │ └─────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Jaeger │ │ Zipkin │ │ Tempo │ └─────────┘ └─────────┘ └─────────┘

Trace Context Propagation

HTTP Headers (W3C Trace Context): traceparent: 00-{trace-id}-{span-id}-{flags} tracestate: vendor1=value1,vendor2=value2

Example: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 │ │ │ └─ sampled │ │ └─ parent span id │ └─ trace id (128-bit) └─ version

Propagation across services: ┌─────────────┐ ┌─────────────┐ │ Service A │ ─── HTTP ──────────►│ Service B │ │ │ traceparent: 00-... │ │ │ Create Span │ │ Extract │ │ Inject │ │ Create Span │ └─────────────┘ └─────────────┘

Span Attributes

Semantic conventions (standard attributes):

HTTP:

http.method: GET, POST, etc.
http.url: Full URL
http.status_code: 200, 404, 500
http.route: /users/{id}

Database:

db.system: postgresql, mysql
db.statement: SELECT * FROM...
db.operation: query, insert

RPC:

rpc.system: grpc
rpc.service: OrderService
rpc.method: CreateOrder

Custom:

user.id: 12345
order.total: 99.99
feature.flag: experiment_v2

Tracing Backends

Jaeger

Features:

Open source (CNCF)
Built-in UI
Multiple storage backends
OpenTelemetry native

Architecture: ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Agent │─►│ Collector │─►│ Storage │ │ (optional) │ │ │ │ (Cassandra/ │ └─────────────┘ └─────────────┘ │ Elasticsearch) │ └─────────────┘ ▼ ┌─────────────┐ │ Query │ │ Service │ └─────────────┘ │ ▼ ┌─────────────┐ │ UI │ └─────────────┘

Zipkin

Features:

Mature, battle-tested
Simple architecture
Low resource overhead
Good ecosystem support

Best for:

Simpler setups
Lower resource environments
Teams familiar with Zipkin

Grafana Tempo

Features:

Object storage backend (cheap)
Deep Grafana integration
Log-based trace discovery
Exemplars support

Best for:

Grafana-heavy environments
Cost-sensitive deployments
Large-scale traces

Cloud Native Options

Provider Service Integration

AWS X-Ray Native AWS services

GCP Cloud Trace Native GCP services

Azure Application Insights Native Azure services

Datadog APM Full-stack observability

Sampling Strategies

Why Sample?

High-traffic systems generate millions of spans. Storing all spans is expensive and often unnecessary.

Sampling: Collect a subset of traces

Goal: Keep enough data to debug issues while managing costs

Sampling Types

Head-based sampling (at trace start):
- Decision made when trace begins
- Consistent across services
- Simple but may miss rare events
Tail-based sampling (after trace complete):
- Decision made after seeing full trace
- Can keep interesting traces (errors, slow)
- Requires buffering spans
- More complex infrastructure
Priority sampling:
- Assign priority based on attributes
- Keep all errors, sample normal traffic

Sampling Strategies

Rate-based:

Sample 10% of all traces
Simple, predictable cost

Priority-based:

100% of errors
100% of slow requests (>1s)
5% of normal requests

Adaptive:

Adjust rate based on traffic
Target specific traces/second
Handle traffic spikes

Correlation Patterns

Logs-Traces-Metrics

Three Pillars of Observability:

Logs ◄──────────► Traces ◄──────────► Metrics │ │ │ │ trace_id │ exemplars │ │ span_id │ │ └──────────────────┴───────────────────┘

Correlation:

Add trace_id/span_id to log entries
Add exemplars (trace links) to metrics
Click from metric → trace → logs

Log Correlation

Structured log with trace context:

{ "timestamp": "2024-01-15T10:30:00Z", "level": "ERROR", "message": "Payment failed", "trace_id": "abc123def456", "span_id": "789xyz", "service": "payment-service", "user_id": "12345", "error": "Card declined" }

Query in log aggregator: trace_id:"abc123def456" → See all logs for this request

Exemplars (Metrics to Traces)

Metric with exemplar: http_request_duration{service="api"} = 2.5s └── exemplar: trace_id=abc123

When latency spikes:

See metric spike in dashboard
Click on data point
Jump directly to slow trace
See exactly what caused latency

Instrumentation Patterns

Automatic Instrumentation

Zero-code instrumentation:

HTTP clients/servers
Database clients
Message queues
gRPC

Pros: Easy, comprehensive Cons: Less control, more noise

Manual Instrumentation

Add spans for business logic:

with tracer.start_span("process_order") as span: span.set_attribute("order.id", order_id) span.set_attribute("order.items", len(items))

result = process(order)

if result.error:
    span.set_status(Status(StatusCode.ERROR))
    span.record_exception(result.error)

Pros: Precise, business-relevant Cons: More code, maintenance

Hybrid Approach (Recommended)

Auto-instrument infrastructure:
- HTTP, database, queue calls
Manual instrument business logic:
- Key operations
- Business metrics
- Error context

Best Practices

Span Design

Good span names:

HTTP GET /api/orders/{id}
ProcessPayment
db.query users

Bad span names:

Handler (too generic)
/api/orders/12345 (cardinality explosion)
doStuff (meaningless)

Attribute Guidelines

Do:

Use semantic conventions
Add business context (user_id, order_id)
Keep cardinality low
Include error details

Don't:

Add PII (personally identifiable info)
Use high-cardinality values as attributes
Add large payloads
Include sensitive data

Performance Considerations

Use async span export
Sample appropriately
Limit attribute count
Use span processor batching
Consider span limits

Troubleshooting with Traces

Common Patterns

Finding slow requests:

Query traces by duration > threshold
Identify slow spans
Check span attributes for context

Finding errors:

Query traces by status = ERROR
See error span and context
Check exception details

Finding dependencies:

View service map from traces
Identify critical paths
Find hidden dependencies

Related Skills

observability-patterns
Three pillars overview
slo-sli-error-budget
Using traces for SLIs
incident-response
Using traces in incidents

distributed-tracing

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering