Instrumentation Planning
Strategic planning for application instrumentation before implementation.
When to Use This Skill
-
Planning instrumentation for new services
-
Reviewing instrumentation strategy
-
Establishing naming conventions
-
Managing telemetry cardinality
-
Setting instrumentation budgets
Instrumentation Strategy Framework
What to Instrument
Instrumentation Layers:
┌─────────────────────────────────────────────────────────────────┐ │ Layer 1: Automatic/Library Instrumentation │ │ - HTTP clients/servers (auto-captured) │ │ - Database clients (auto-captured) │ │ - Message queue clients (auto-captured) │ │ - Framework-provided metrics │ │ Effort: Low | Coverage: Broad | Customization: Limited │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 2: Business Transaction Instrumentation │ │ - Key user journeys │ │ - Business operations (checkout, signup, etc.) │ │ - Revenue-generating flows │ │ - SLA-bound operations │ │ Effort: Medium | Coverage: Targeted | Value: High │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 3: Debug/Diagnostic Instrumentation │ │ - Algorithmic hot paths │ │ - Cache behavior │ │ - Circuit breaker states │ │ - Retry/fallback paths │ │ Effort: Medium | Coverage: Deep | Use: Troubleshooting │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 4: Business Metrics │ │ - Domain-specific counters │ │ - Conversion rates │ │ - Feature usage │ │ - Customer behavior │ │ Effort: High | Coverage: Custom | Value: Business Insights │ └─────────────────────────────────────────────────────────────────┘
Instrumentation Decision Matrix
instrumentation_decisions: always_instrument: - "Inbound HTTP/gRPC requests" - "Outbound HTTP/gRPC calls" - "Database queries" - "Message publish/consume" - "Authentication/authorization" - "External API calls" - "Cache operations"
consider_instrumenting: - "Complex business logic" - "Feature flags evaluation" - "Background jobs" - "Scheduled tasks" - "File I/O operations" - "CPU-intensive operations"
avoid_instrumenting: - "Every method call (too noisy)" - "Tight loops (performance impact)" - "Data transformation (low value)" - "Validation helpers" - "Utility functions"
decision_criteria: business_value: weight: 0.3 question: "Does this help understand business outcomes?"
debugging_value:
weight: 0.25
question: "Does this help diagnose production issues?"
slo_relevance:
weight: 0.25
question: "Does this contribute to SLI measurement?"
cost_impact:
weight: 0.2
question: "Is the cardinality/volume acceptable?"
Naming Conventions
Metric Naming
metric_naming: format: "[namespace][subsystem][name]_[unit]"
rules: case: "snake_case" unit_suffix: "Always include (_seconds, _bytes, _total)" base_units: "Use base units (seconds not milliseconds)" counter_suffix: "_total for counters"
examples: good: - "http_server_requests_total" - "http_server_request_duration_seconds" - "http_server_response_size_bytes" - "db_connections_current" - "order_processing_duration_seconds" - "payment_transactions_total"
bad:
- "requests (no unit, no namespace)"
- "HttpRequestDuration (wrong case)"
- "order_latency_ms (use base units)"
- "totalOrders (camelCase, no unit)"
label_naming: case: "snake_case" avoid: - "Embedded values in name (path=/users)" - "High cardinality labels" good_labels: - "method, status_code, path" - "service, version, environment" bad_labels: - "user_id (high cardinality)" - "request_id (high cardinality)" - "timestamp (not a dimension)"
Span Naming
span_naming: format: "[operation] [resource]"
rules: - "Use verb + noun pattern" - "Keep names low cardinality" - "Include operation type, not specific values" - "Be consistent across services"
examples: http: pattern: "HTTP {METHOD} {route_template}" good: "HTTP GET /users/{id}" bad: "HTTP GET /users/12345"
database:
pattern: "{operation} {table}"
good: "SELECT orders"
bad: "SELECT * FROM orders WHERE id=123"
messaging:
pattern: "{operation} {queue/topic}"
good: "PUBLISH order-events"
bad: "publish message to order-events queue"
rpc:
pattern: "{service}/{method}"
good: "OrderService/CreateOrder"
bad: "grpc call to order service"
attributes: required: - "service.name" - "service.version" - "deployment.environment"
recommended:
http:
- "http.method"
- "http.route"
- "http.status_code"
- "http.target"
database:
- "db.system"
- "db.name"
- "db.operation"
- "db.statement (sanitized)"
messaging:
- "messaging.system"
- "messaging.destination"
- "messaging.operation"
Log Field Naming
log_naming: format: "snake_case for all fields"
standard_fields: timestamp: "ISO 8601 format" level: "INFO, WARN, ERROR, etc." message: "Human-readable description" service: "Service name" trace_id: "Correlation ID" span_id: "Current span"
domain_fields: pattern: "{domain}_{field}" examples: - "order_id" - "customer_id" - "payment_amount" - "product_sku"
avoid: - "Nested objects (flatten for indexing)" - "Arrays of unknown length" - "Large text blobs" - "Sensitive data (PII, secrets)"
Cardinality Management
Understanding Cardinality
Cardinality = Number of unique time series
Example: http_requests_total{method="GET", path="/api/users", status="200"}
Cardinality = methods × paths × statuses = 5 × 100 × 10 = 5,000 time series
With user_id (1M users): = 5 × 100 × 10 × 1,000,000 = 5,000,000,000 time series ← EXPLOSION!
Cardinality Budget
cardinality_budget: planning: total_budget: 100000 # Target max time series per service allocation: automatic_instrumentation: 30% # 30,000 business_transactions: 40% # 40,000 custom_metrics: 20% # 20,000 buffer: 10% # 10,000
per_metric_limits: low_cardinality: max_series: 100 example: "status codes, methods"
medium_cardinality:
max_series: 1000
example: "endpoints, operations"
high_cardinality:
max_series: 10000
example: "aggregated by hour"
requires: "Justification and approval"
monitoring: - "Alert on cardinality growth > 10% per day" - "Weekly cardinality reviews" - "Automatic label value limiting"
Cardinality Reduction Techniques
cardinality_reduction: bucketing: before: "path=/users/12345" after: "path=/users/{id}" technique: "Path template extraction"
sampling: description: "Sample high-volume, low-value traces" strategies: head_sampling: "Decide at trace start" tail_sampling: "Decide after seeing full trace" adaptive: "Adjust rate based on volume"
aggregation: description: "Pre-aggregate before export" example: "Count by status, not per request"
value_limiting: description: "Cap unique values per label" example: "Max 100 unique paths, then 'other'"
dropping: description: "Drop low-value dimensions" candidates: - "Instance ID (use service name)" - "Request ID (not for metrics)" - "Full URLs (use route templates)"
Instrumentation Budget
Performance Impact
performance_budget: cpu_overhead: target: "< 1% CPU increase" measurement: "Profile with/without instrumentation"
memory_overhead: target: "< 50MB additional heap" components: - "Metric registries" - "Span buffers" - "Log buffers"
latency_overhead: target: "< 1ms per request" hot_paths: "< 100μs"
data_volume: metrics: target: "< 1GB/day per service" calculation: "series × scrape_interval × 8 bytes"
traces:
target: "< 10GB/day per service (with sampling)"
sampling_rate: "1-10% for high-volume services"
logs:
target: "< 5GB/day per service"
strategies: "Sampling, level gating"
Cost Planning
cost_planning: estimation_formula: metrics: monthly_cost: "time_series × $0.003 (typical cloud pricing)" example: "10,000 series × $0.003 = $30/month"
traces:
monthly_cost: "spans_per_month × $0.000005"
example: "100M spans × $0.000005 = $500/month"
logs:
monthly_cost: "GB_per_month × $0.50"
example: "500GB × $0.50 = $250/month"
optimization_strategies: - "Increase scrape interval (15s → 60s)" - "Reduce trace sampling rate" - "Log level gating in production" - "Shorter retention for debug data" - "Downsample old metrics"
Instrumentation Plan Template
instrumentation_plan: service: "{Service Name}" version: "1.0" date: "{Date}" owner: "{Team}"
objectives: - "Track SLIs for order processing" - "Enable distributed tracing for debugging" - "Monitor payment success rates"
automatic_instrumentation: framework: "OpenTelemetry .NET" enabled: - "ASP.NET Core (HTTP server)" - "HttpClient (HTTP client)" - "Entity Framework Core (database)" - "Azure.Messaging.ServiceBus" configuration: sampling_rate: 0.1 # 10% of traces batch_export_interval: 5000 # ms
custom_spans: - name: "ProcessOrder" purpose: "Track order processing duration" attributes: - "order.id" - "order.item_count" - "order.total_amount" events: - "inventory.reserved" - "payment.processed"
- name: "ValidatePayment"
purpose: "Track payment validation steps"
attributes:
- "payment.method"
- "payment.provider"
sensitive: false
custom_metrics: counters: - name: "orders_total" labels: ["status", "payment_method"] purpose: "Count orders by outcome" cardinality_estimate: 20
- name: "payment_failures_total"
labels: ["reason", "provider"]
purpose: "Track payment failure reasons"
cardinality_estimate: 50
histograms:
- name: "order_processing_duration_seconds"
labels: ["order_type"]
purpose: "Track order processing latency"
buckets: [0.1, 0.5, 1, 2, 5, 10]
cardinality_estimate: 10
gauges:
- name: "pending_orders_current"
labels: []
purpose: "Current pending order count"
cardinality_estimate: 1
cardinality_summary: estimated_total: 81 budget: 1000 status: "Within budget"
log_strategy: production_level: "INFO" structured_fields: standard: - "trace_id" - "span_id" - "service" - "environment" domain: - "order_id" - "customer_id (hashed)" sampling: debug_logs: "1% in production"
cost_estimate: monthly: metrics: "$30" traces: "$200" logs: "$150" total: "$380"
review_schedule: frequency: "Quarterly" metrics_to_review: - "Cardinality growth" - "Data volume" - "Cost vs budget"
Related Skills
-
observability-patterns
-
Three pillars overview
-
distributed-tracing
-
Trace implementation details
-
slo-sli-error-budget
-
What to measure for SLOs
Last Updated: 2025-12-26