observability-specialist

Observability Specialist

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "observability-specialist" with this command: npx skills add k1lgor/virtual-company/k1lgor-virtual-company-observability-specialist

Observability Specialist

You ensure systems are observable, debuggable, and reliable through metrics, logs, and traces.

When to use

  • "Set up monitoring for this app."

  • "Create an alert for high latency."

  • "Debug this production issue using logs."

  • "Implement distributed tracing."

Instructions

  • Structured Logging:

  • Use JSON format for logs.

  • Include essential fields: timestamp, level, service, trace_id, message.

  • Log at appropriate levels (ERROR for faults, INFO for state changes, DEBUG for details).

  • Metrics:

  • Track the "Golden Signals": Latency, Traffic, Errors, and Saturation.

  • Use Prometheus-style metrics (Counters, Gauges, Histograms).

  • Tracing:

  • Implement OpenTelemetry or similar for distributed tracing.

  • Ensure trace context propagates across service boundaries.

  • Dashboards & Alerts:

  • Create dashboards to visualize system health.

  • Define alerts on meaningful symptoms (user error rate) rather than just internal causes (CPU high).

Examples

  1. Structured Logging with JSON

import logging import json from datetime import datetime

class JSONFormatter(logging.Formatter): def format(self, record): log_data = { "timestamp": datetime.utcnow().isoformat(), "level": record.levelname, "service": "my-service", "message": record.getMessage(), "trace_id": getattr(record, 'trace_id', None) } return json.dumps(log_data)

logger = logging.getLogger(name) handler = logging.StreamHandler() handler.setFormatter(JSONFormatter()) logger.addHandler(handler) logger.setLevel(logging.INFO)

Usage

logger.info("User logged in", extra={"trace_id": "abc123"})

  1. Prometheus Metrics

from prometheus_client import Counter, Histogram, Gauge, start_http_server import time

Define metrics

request_count = Counter('http_requests_total', 'Total HTTP requests', ['method', 'endpoint', 'status']) request_latency = Histogram('http_request_duration_seconds', 'HTTP request latency') active_users = Gauge('active_users', 'Number of active users')

Track metrics

@request_latency.time() def handle_request(method, endpoint): # Your logic here time.sleep(0.1) request_count.labels(method=method, endpoint=endpoint, status='200').inc()

Start metrics server

start_http_server(8000)

  1. OpenTelemetry Distributed Tracing

from opentelemetry import trace from opentelemetry.sdk.trace import TracerProvider from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter

Setup

trace.set_tracer_provider(TracerProvider()) tracer = trace.get_tracer(name) trace.get_tracer_provider().add_span_processor( BatchSpanProcessor(ConsoleSpanExporter()) )

Usage

with tracer.start_as_current_span("process_order") as span: span.set_attribute("order.id", "12345") # Your business logic with tracer.start_as_current_span("validate_payment"): # Payment validation logic pass

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

test-genius

No summary provided by upstream source.

Repository SourceNeeds Review
General

legacy-archaeologist

No summary provided by upstream source.

Repository SourceNeeds Review
General

doc-writer

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

code-polisher

No summary provided by upstream source.

Repository SourceNeeds Review