instrumentation-planning

Instrumentation Planning

Strategic planning for application instrumentation before implementation.

When to Use This Skill

Planning instrumentation for new services
Reviewing instrumentation strategy
Establishing naming conventions
Managing telemetry cardinality
Setting instrumentation budgets

Instrumentation Strategy Framework

What to Instrument

Instrumentation Layers:

┌─────────────────────────────────────────────────────────────────┐ │ Layer 1: Automatic/Library Instrumentation │ │ - HTTP clients/servers (auto-captured) │ │ - Database clients (auto-captured) │ │ - Message queue clients (auto-captured) │ │ - Framework-provided metrics │ │ Effort: Low | Coverage: Broad | Customization: Limited │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 2: Business Transaction Instrumentation │ │ - Key user journeys │ │ - Business operations (checkout, signup, etc.) │ │ - Revenue-generating flows │ │ - SLA-bound operations │ │ Effort: Medium | Coverage: Targeted | Value: High │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 3: Debug/Diagnostic Instrumentation │ │ - Algorithmic hot paths │ │ - Cache behavior │ │ - Circuit breaker states │ │ - Retry/fallback paths │ │ Effort: Medium | Coverage: Deep | Use: Troubleshooting │ ├─────────────────────────────────────────────────────────────────┤ │ Layer 4: Business Metrics │ │ - Domain-specific counters │ │ - Conversion rates │ │ - Feature usage │ │ - Customer behavior │ │ Effort: High | Coverage: Custom | Value: Business Insights │ └─────────────────────────────────────────────────────────────────┘

Instrumentation Decision Matrix

instrumentation_decisions: always_instrument: - "Inbound HTTP/gRPC requests" - "Outbound HTTP/gRPC calls" - "Database queries" - "Message publish/consume" - "Authentication/authorization" - "External API calls" - "Cache operations"

consider_instrumenting: - "Complex business logic" - "Feature flags evaluation" - "Background jobs" - "Scheduled tasks" - "File I/O operations" - "CPU-intensive operations"

avoid_instrumenting: - "Every method call (too noisy)" - "Tight loops (performance impact)" - "Data transformation (low value)" - "Validation helpers" - "Utility functions"

decision_criteria: business_value: weight: 0.3 question: "Does this help understand business outcomes?"

debugging_value:
  weight: 0.25
  question: "Does this help diagnose production issues?"

slo_relevance:
  weight: 0.25
  question: "Does this contribute to SLI measurement?"

cost_impact:
  weight: 0.2
  question: "Is the cardinality/volume acceptable?"

Naming Conventions

Metric Naming

metric_naming: format: "[namespace][subsystem][name]_[unit]"

rules: case: "snake_case" unit_suffix: "Always include (_seconds, _bytes, _total)" base_units: "Use base units (seconds not milliseconds)" counter_suffix: "_total for counters"

examples: good: - "http_server_requests_total" - "http_server_request_duration_seconds" - "http_server_response_size_bytes" - "db_connections_current" - "order_processing_duration_seconds" - "payment_transactions_total"

bad:
  - "requests (no unit, no namespace)"
  - "HttpRequestDuration (wrong case)"
  - "order_latency_ms (use base units)"
  - "totalOrders (camelCase, no unit)"

label_naming: case: "snake_case" avoid: - "Embedded values in name (path=/users)" - "High cardinality labels" good_labels: - "method, status_code, path" - "service, version, environment" bad_labels: - "user_id (high cardinality)" - "request_id (high cardinality)" - "timestamp (not a dimension)"

Span Naming

span_naming: format: "[operation] [resource]"

rules: - "Use verb + noun pattern" - "Keep names low cardinality" - "Include operation type, not specific values" - "Be consistent across services"

examples: http: pattern: "HTTP {METHOD} {route_template}" good: "HTTP GET /users/{id}" bad: "HTTP GET /users/12345"

database:
  pattern: "{operation} {table}"
  good: "SELECT orders"
  bad: "SELECT * FROM orders WHERE id=123"

messaging:
  pattern: "{operation} {queue/topic}"
  good: "PUBLISH order-events"
  bad: "publish message to order-events queue"

rpc:
  pattern: "{service}/{method}"
  good: "OrderService/CreateOrder"
  bad: "grpc call to order service"

attributes: required: - "service.name" - "service.version" - "deployment.environment"

recommended:
  http:
    - "http.method"
    - "http.route"
    - "http.status_code"
    - "http.target"

  database:
    - "db.system"
    - "db.name"
    - "db.operation"
    - "db.statement (sanitized)"

  messaging:
    - "messaging.system"
    - "messaging.destination"
    - "messaging.operation"

Log Field Naming

log_naming: format: "snake_case for all fields"

standard_fields: timestamp: "ISO 8601 format" level: "INFO, WARN, ERROR, etc." message: "Human-readable description" service: "Service name" trace_id: "Correlation ID" span_id: "Current span"

domain_fields: pattern: "{domain}_{field}" examples: - "order_id" - "customer_id" - "payment_amount" - "product_sku"

avoid: - "Nested objects (flatten for indexing)" - "Arrays of unknown length" - "Large text blobs" - "Sensitive data (PII, secrets)"

Cardinality Management

Understanding Cardinality

Cardinality = Number of unique time series

Example: http_requests_total{method="GET", path="/api/users", status="200"}

Cardinality = methods × paths × statuses = 5 × 100 × 10 = 5,000 time series

With user_id (1M users): = 5 × 100 × 10 × 1,000,000 = 5,000,000,000 time series ← EXPLOSION!

Cardinality Budget

cardinality_budget: planning: total_budget: 100000 # Target max time series per service allocation: automatic_instrumentation: 30% # 30,000 business_transactions: 40% # 40,000 custom_metrics: 20% # 20,000 buffer: 10% # 10,000

per_metric_limits: low_cardinality: max_series: 100 example: "status codes, methods"

medium_cardinality:
  max_series: 1000
  example: "endpoints, operations"

high_cardinality:
  max_series: 10000
  example: "aggregated by hour"
  requires: "Justification and approval"

monitoring: - "Alert on cardinality growth > 10% per day" - "Weekly cardinality reviews" - "Automatic label value limiting"

Cardinality Reduction Techniques

cardinality_reduction: bucketing: before: "path=/users/12345" after: "path=/users/{id}" technique: "Path template extraction"

sampling: description: "Sample high-volume, low-value traces" strategies: head_sampling: "Decide at trace start" tail_sampling: "Decide after seeing full trace" adaptive: "Adjust rate based on volume"

aggregation: description: "Pre-aggregate before export" example: "Count by status, not per request"

value_limiting: description: "Cap unique values per label" example: "Max 100 unique paths, then 'other'"

dropping: description: "Drop low-value dimensions" candidates: - "Instance ID (use service name)" - "Request ID (not for metrics)" - "Full URLs (use route templates)"

Instrumentation Budget

Performance Impact

performance_budget: cpu_overhead: target: "< 1% CPU increase" measurement: "Profile with/without instrumentation"

memory_overhead: target: "< 50MB additional heap" components: - "Metric registries" - "Span buffers" - "Log buffers"

latency_overhead: target: "< 1ms per request" hot_paths: "< 100μs"

data_volume: metrics: target: "< 1GB/day per service" calculation: "series × scrape_interval × 8 bytes"

traces:
  target: "&#x3C; 10GB/day per service (with sampling)"
  sampling_rate: "1-10% for high-volume services"

logs:
  target: "&#x3C; 5GB/day per service"
  strategies: "Sampling, level gating"

Cost Planning

cost_planning: estimation_formula: metrics: monthly_cost: "time_series × $0.003 (typical cloud pricing)" example: "10,000 series × $0.003 = $30/month"

traces:
  monthly_cost: "spans_per_month × $0.000005"
  example: "100M spans × $0.000005 = $500/month"

logs:
  monthly_cost: "GB_per_month × $0.50"
  example: "500GB × $0.50 = $250/month"

optimization_strategies: - "Increase scrape interval (15s → 60s)" - "Reduce trace sampling rate" - "Log level gating in production" - "Shorter retention for debug data" - "Downsample old metrics"

Instrumentation Plan Template

instrumentation_plan: service: "{Service Name}" version: "1.0" date: "{Date}" owner: "{Team}"

objectives: - "Track SLIs for order processing" - "Enable distributed tracing for debugging" - "Monitor payment success rates"

automatic_instrumentation: framework: "OpenTelemetry .NET" enabled: - "ASP.NET Core (HTTP server)" - "HttpClient (HTTP client)" - "Entity Framework Core (database)" - "Azure.Messaging.ServiceBus" configuration: sampling_rate: 0.1 # 10% of traces batch_export_interval: 5000 # ms

custom_spans: - name: "ProcessOrder" purpose: "Track order processing duration" attributes: - "order.id" - "order.item_count" - "order.total_amount" events: - "inventory.reserved" - "payment.processed"

- name: "ValidatePayment"
  purpose: "Track payment validation steps"
  attributes:
    - "payment.method"
    - "payment.provider"
  sensitive: false

custom_metrics: counters: - name: "orders_total" labels: ["status", "payment_method"] purpose: "Count orders by outcome" cardinality_estimate: 20

  - name: "payment_failures_total"
    labels: ["reason", "provider"]
    purpose: "Track payment failure reasons"
    cardinality_estimate: 50

histograms:
  - name: "order_processing_duration_seconds"
    labels: ["order_type"]
    purpose: "Track order processing latency"
    buckets: [0.1, 0.5, 1, 2, 5, 10]
    cardinality_estimate: 10

gauges:
  - name: "pending_orders_current"
    labels: []
    purpose: "Current pending order count"
    cardinality_estimate: 1

cardinality_summary: estimated_total: 81 budget: 1000 status: "Within budget"

log_strategy: production_level: "INFO" structured_fields: standard: - "trace_id" - "span_id" - "service" - "environment" domain: - "order_id" - "customer_id (hashed)" sampling: debug_logs: "1% in production"

cost_estimate: monthly: metrics: "$30" traces: "$200" logs: "$150" total: "$380"

review_schedule: frequency: "Quarterly" metrics_to_review: - "Cardinality growth" - "Data volume" - "Cost vs budget"

Related Skills

observability-patterns
Three pillars overview
distributed-tracing
Trace implementation details
slo-sli-error-budget
What to measure for SLOs

Last Updated: 2025-12-26

instrumentation-planning

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

design-thinking

plantuml-syntax

system-prompt-engineering