service-mesh-observability

Service Mesh Observability

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "service-mesh-observability" with this command: npx skills add wshobson/agents/wshobson-agents-service-mesh-observability

Service Mesh Observability

Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.

When to Use This Skill

  • Setting up distributed tracing across services

  • Implementing service mesh metrics and dashboards

  • Debugging latency and error issues

  • Defining SLOs for service communication

  • Visualizing service dependencies

  • Troubleshooting mesh connectivity

Core Concepts

  1. Three Pillars of Observability

┌─────────────────────────────────────────────────────┐ │ Observability │ ├─────────────────┬─────────────────┬─────────────────┤ │ Metrics │ Traces │ Logs │ │ │ │ │ │ • Request rate │ • Span context │ • Access logs │ │ • Error rate │ • Latency │ • Error details │ │ • Latency P50 │ • Dependencies │ • Debug info │ │ • Saturation │ • Bottlenecks │ • Audit trail │ └─────────────────┴─────────────────┴─────────────────┘

  1. Golden Signals for Mesh

Signal Description Alert Threshold

Latency Request duration P50, P99 P99 > 500ms

Traffic Requests per second Anomaly detection

Errors 5xx error rate

1%

Saturation Resource utilization

80%

Templates

Template 1: Istio with Prometheus & Grafana

Install Prometheus

apiVersion: v1 kind: ConfigMap metadata: name: prometheus namespace: istio-system data: prometheus.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'istio-mesh' kubernetes_sd_configs: - role: endpoints namespaces: names: - istio-system relabel_configs: - source_labels: [__meta_kubernetes_service_name] action: keep regex: istio-telemetry

ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: istio-mesh namespace: istio-system spec: selector: matchLabels: app: istiod endpoints: - port: http-monitoring interval: 15s

Template 2: Key Istio Metrics Queries

Request rate by service

sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)

Error rate (5xx)

sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100

P99 latency

histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name))

TCP connections

sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)

Request size

histogram_quantile(0.99, sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (le, destination_service_name))

Template 3: Jaeger Distributed Tracing

Jaeger installation for Istio

apiVersion: install.istio.io/v1alpha1 kind: IstioOperator spec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411

Jaeger deployment

apiVersion: apps/v1 kind: Deployment metadata: name: jaeger namespace: istio-system spec: selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: - name: jaeger image: jaegertracing/all-in-one:1.50 ports: - containerPort: 5775 # UDP - containerPort: 6831 # Thrift - containerPort: 6832 # Thrift - containerPort: 5778 # Config - containerPort: 16686 # UI - containerPort: 14268 # HTTP - containerPort: 14250 # gRPC - containerPort: 9411 # Zipkin env: - name: COLLECTOR_ZIPKIN_HOST_PORT value: ":9411"

Template 4: Linkerd Viz Dashboard

Install Linkerd viz extension

linkerd viz install | kubectl apply -f -

Access dashboard

linkerd viz dashboard

CLI commands for observability

Top requests

linkerd viz top deploy/my-app

Per-route metrics

linkerd viz routes deploy/my-app --to deploy/backend

Live traffic inspection

linkerd viz tap deploy/my-app --to deploy/backend

Service edges (dependencies)

linkerd viz edges deployment -n my-namespace

Template 5: Grafana Dashboard JSON

{ "dashboard": { "title": "Service Mesh Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Error Rate", "type": "gauge", "targets": [ { "expr": "sum(rate(istio_requests_total{response_code=~"5.."}[5m])) / sum(rate(istio_requests_total[5m])) * 100" } ], "fieldConfig": { "defaults": { "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 1, "color": "yellow" }, { "value": 5, "color": "red" } ] } } } }, { "title": "P99 Latency", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name))", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Service Topology", "type": "nodeGraph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter="destination"}[5m])) by (source_workload, destination_service_name)" } ] } ] } }

Template 6: Kiali Service Mesh Visualization

Kiali installation

apiVersion: kiali.io/v1alpha1 kind: Kiali metadata: name: kiali namespace: istio-system spec: auth: strategy: anonymous # or openid, token deployment: accessible_namespaces: - "**" external_services: prometheus: url: http://prometheus.istio-system:9090 tracing: url: http://jaeger-query.istio-system:16686 grafana: url: http://grafana.istio-system:3000

Template 7: OpenTelemetry Integration

OpenTelemetry Collector for mesh

apiVersion: v1 kind: ConfigMap metadata: name: otel-collector-config data: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 zipkin: endpoint: 0.0.0.0:9411

processors:
  batch:
    timeout: 10s

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889

service:
  pipelines:
    traces:
      receivers: [otlp, zipkin]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]

Istio Telemetry v2 with OTel

apiVersion: telemetry.istio.io/v1alpha1 kind: Telemetry metadata: name: mesh-default namespace: istio-system spec: tracing: - providers: - name: otel randomSamplingPercentage: 10

Alerting Rules

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: mesh-alerts namespace: istio-system spec: groups: - name: mesh.rules rules: - alert: HighErrorRate expr: | sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name) / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate for {{ $labels.destination_service_name }}"

    - alert: HighLatency
      expr: |
        histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
        by (le, destination_service_name)) > 1000
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High P99 latency for {{ $labels.destination_service_name }}"

    - alert: MeshCertExpiring
      expr: |
        (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
      labels:
        severity: warning
      annotations:
        summary: "Mesh certificate expiring in less than 7 days"

Best Practices

Do's

  • Sample appropriately - 100% in dev, 1-10% in prod

  • Use trace context - Propagate headers consistently

  • Set up alerts - For golden signals

  • Correlate metrics/traces - Use exemplars

  • Retain strategically - Hot/cold storage tiers

Don'ts

  • Don't over-sample - Storage costs add up

  • Don't ignore cardinality - Limit label values

  • Don't skip dashboards - Visualize dependencies

  • Don't forget costs - Monitor observability costs

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

tailwind-design-system

Tailwind Design System (v4)

Repository Source
31.3K19K
wshobson
Automation

api-design-principles

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

nodejs-backend-patterns

No summary provided by upstream source.

Repository SourceNeeds Review