Distributed Tracing
Patterns and practices for implementing distributed tracing across microservices and understanding request flows in distributed systems.
When to Use This Skill
-
Implementing distributed tracing in microservices
-
Debugging cross-service request issues
-
Understanding trace propagation
-
Choosing tracing infrastructure
-
Correlating logs, metrics, and traces
Why Distributed Tracing?
Problem: Request flows through multiple services How do you debug when something fails?
Without tracing: User → API → ??? → ??? → Error somewhere
With tracing: User → API (50ms) → OrderService (20ms) → PaymentService (ERROR: timeout) └── Full visibility into request flow
Core Concepts
Traces, Spans, and Context
Trace: End-to-end request journey ├── Span: Single operation within a service │ ├── SpanID: Unique identifier │ ├── ParentSpanID: Link to parent span │ ├── TraceID: Shared across all spans │ ├── Operation Name: What is being done │ ├── Start/End Time: Duration │ ├── Status: Success/Error │ ├── Attributes: Key-value metadata │ └── Events: Point-in-time annotations │ └── Context: Propagated across service boundaries ├── TraceID ├── SpanID ├── Trace Flags └── Trace State
Trace Visualization
TraceID: abc123
Service A (API Gateway) ├──────────────────────────────────────────────────────┤ 200ms │ └─► Service B (Order Service) ├───────────────────────────────────┤ 150ms │ ├─► Service C (Inventory) │ ├───────────────┤ 50ms │ └─► Service D (Payment) ├───────────────────────┤ 80ms │ └─► External API ├─────────┤ 60ms
OpenTelemetry
Overview
OpenTelemetry = Unified observability framework
Components: ┌─────────────────────────────────────────────────────┐ │ Application │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ SDK │ │ Tracer │ │ Meter │ │ │ │ │ │ Provider │ │ Provider │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ └─────────────────────────────────────────────────────┘ │ │ │ └───────────────┼───────────────┘ ▼ ┌─────────────────────────┐ │ OTLP Exporter │ └─────────────────────────┘ │ ▼ ┌─────────────────────────┐ │ Collector │ │ (Optional) │ └─────────────────────────┘ │ ┌───────────────┼───────────────┐ ▼ ▼ ▼ ┌─────────┐ ┌─────────┐ ┌─────────┐ │ Jaeger │ │ Zipkin │ │ Tempo │ └─────────┘ └─────────┘ └─────────┘
Trace Context Propagation
HTTP Headers (W3C Trace Context): traceparent: 00-{trace-id}-{span-id}-{flags} tracestate: vendor1=value1,vendor2=value2
Example: traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01 │ │ │ └─ sampled │ │ └─ parent span id │ └─ trace id (128-bit) └─ version
Propagation across services: ┌─────────────┐ ┌─────────────┐ │ Service A │ ─── HTTP ──────────►│ Service B │ │ │ traceparent: 00-... │ │ │ Create Span │ │ Extract │ │ Inject │ │ Create Span │ └─────────────┘ └─────────────┘
Span Attributes
Semantic conventions (standard attributes):
HTTP:
- http.method: GET, POST, etc.
- http.url: Full URL
- http.status_code: 200, 404, 500
- http.route: /users/{id}
Database:
- db.system: postgresql, mysql
- db.statement: SELECT * FROM...
- db.operation: query, insert
RPC:
- rpc.system: grpc
- rpc.service: OrderService
- rpc.method: CreateOrder
Custom:
- user.id: 12345
- order.total: 99.99
- feature.flag: experiment_v2
Tracing Backends
Jaeger
Features:
- Open source (CNCF)
- Built-in UI
- Multiple storage backends
- OpenTelemetry native
Architecture: ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ Agent │─►│ Collector │─►│ Storage │ │ (optional) │ │ │ │ (Cassandra/ │ └─────────────┘ └─────────────┘ │ Elasticsearch) │ └─────────────┘ ▼ ┌─────────────┐ │ Query │ │ Service │ └─────────────┘ │ ▼ ┌─────────────┐ │ UI │ └─────────────┘
Zipkin
Features:
- Mature, battle-tested
- Simple architecture
- Low resource overhead
- Good ecosystem support
Best for:
- Simpler setups
- Lower resource environments
- Teams familiar with Zipkin
Grafana Tempo
Features:
- Object storage backend (cheap)
- Deep Grafana integration
- Log-based trace discovery
- Exemplars support
Best for:
- Grafana-heavy environments
- Cost-sensitive deployments
- Large-scale traces
Cloud Native Options
Provider Service Integration
AWS X-Ray Native AWS services
GCP Cloud Trace Native GCP services
Azure Application Insights Native Azure services
Datadog APM Full-stack observability
Sampling Strategies
Why Sample?
High-traffic systems generate millions of spans. Storing all spans is expensive and often unnecessary.
Sampling: Collect a subset of traces
Goal: Keep enough data to debug issues while managing costs
Sampling Types
-
Head-based sampling (at trace start):
- Decision made when trace begins
- Consistent across services
- Simple but may miss rare events
-
Tail-based sampling (after trace complete):
- Decision made after seeing full trace
- Can keep interesting traces (errors, slow)
- Requires buffering spans
- More complex infrastructure
-
Priority sampling:
- Assign priority based on attributes
- Keep all errors, sample normal traffic
Sampling Strategies
Rate-based:
- Sample 10% of all traces
- Simple, predictable cost
Priority-based:
- 100% of errors
- 100% of slow requests (>1s)
- 5% of normal requests
Adaptive:
- Adjust rate based on traffic
- Target specific traces/second
- Handle traffic spikes
Correlation Patterns
Logs-Traces-Metrics
Three Pillars of Observability:
Logs ◄──────────► Traces ◄──────────► Metrics │ │ │ │ trace_id │ exemplars │ │ span_id │ │ └──────────────────┴───────────────────┘
Correlation:
- Add trace_id/span_id to log entries
- Add exemplars (trace links) to metrics
- Click from metric → trace → logs
Log Correlation
Structured log with trace context:
{ "timestamp": "2024-01-15T10:30:00Z", "level": "ERROR", "message": "Payment failed", "trace_id": "abc123def456", "span_id": "789xyz", "service": "payment-service", "user_id": "12345", "error": "Card declined" }
Query in log aggregator: trace_id:"abc123def456" → See all logs for this request
Exemplars (Metrics to Traces)
Metric with exemplar: http_request_duration{service="api"} = 2.5s └── exemplar: trace_id=abc123
When latency spikes:
- See metric spike in dashboard
- Click on data point
- Jump directly to slow trace
- See exactly what caused latency
Instrumentation Patterns
Automatic Instrumentation
Zero-code instrumentation:
- HTTP clients/servers
- Database clients
- Message queues
- gRPC
Pros: Easy, comprehensive Cons: Less control, more noise
Manual Instrumentation
Add spans for business logic:
with tracer.start_span("process_order") as span: span.set_attribute("order.id", order_id) span.set_attribute("order.items", len(items))
result = process(order)
if result.error:
span.set_status(Status(StatusCode.ERROR))
span.record_exception(result.error)
Pros: Precise, business-relevant Cons: More code, maintenance
Hybrid Approach (Recommended)
-
Auto-instrument infrastructure:
- HTTP, database, queue calls
-
Manual instrument business logic:
- Key operations
- Business metrics
- Error context
Best Practices
Span Design
Good span names:
- HTTP GET /api/orders/{id}
- ProcessPayment
- db.query users
Bad span names:
- Handler (too generic)
- /api/orders/12345 (cardinality explosion)
- doStuff (meaningless)
Attribute Guidelines
Do:
- Use semantic conventions
- Add business context (user_id, order_id)
- Keep cardinality low
- Include error details
Don't:
- Add PII (personally identifiable info)
- Use high-cardinality values as attributes
- Add large payloads
- Include sensitive data
Performance Considerations
- Use async span export
- Sample appropriately
- Limit attribute count
- Use span processor batching
- Consider span limits
Troubleshooting with Traces
Common Patterns
Finding slow requests:
- Query traces by duration > threshold
- Identify slow spans
- Check span attributes for context
Finding errors:
- Query traces by status = ERROR
- See error span and context
- Check exception details
Finding dependencies:
- View service map from traces
- Identify critical paths
- Find hidden dependencies
Related Skills
-
observability-patterns
-
Three pillars overview
-
slo-sli-error-budget
-
Using traces for SLIs
-
incident-response
-
Using traces in incidents