sre-monitoring-and-observability

SRE Monitoring and Observability

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "sre-monitoring-and-observability" with this command: npx skills add thebushidocollective/han/thebushidocollective-han-sre-monitoring-and-observability

SRE Monitoring and Observability

Building comprehensive monitoring and observability systems.

Four Golden Signals

Latency

Time to process requests:

Request duration

http_request_duration_seconds

Query

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )

Traffic

Demand on the system:

Requests per second

rate(http_requests_total[5m])

By endpoint

sum(rate(http_requests_total[5m])) by (endpoint)

Errors

Rate of failed requests:

Error rate

rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

SLI compliance

1 - (error_rate / slo_target)

Saturation

Resource utilization:

CPU usage

100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Memory usage

(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100

Service Level Indicators (SLIs)

Availability SLI

Successful requests / Total requests

sum(rate(http_requests_total{status=~"[23].."}[30d])) / sum(rate(http_requests_total[30d]))

Latency SLI

Requests faster than threshold / Total requests

sum(rate(http_request_duration_seconds_bucket{le="0.5"}[30d])) / sum(rate(http_request_duration_seconds_count[30d]))

Throughput SLI

Requests processed within capacity

clamp_max( rate(http_requests_total[5m]) / capacity_requests_per_second, 1.0 )

Alerting

Alert Severity Levels

P0 - Critical: Service down or severe degradation P1 - High: Significant impact, error budget at risk

P2 - Medium: Degradation, not user-facing yet P3 - Low: Awareness, no immediate action needed

Example Alerts

High error rate

groups:

  • name: sre rules:
    • alert: HighErrorRate expr: | rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])

      0.05 for: 5m labels: severity: critical annotations: summary: "High error rate on {{ $labels.service }}"

    • alert: LatencyP95High expr: | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) ) > 1.0 for: 10m labels: severity: warning

    • alert: ErrorBudgetBurn expr: | (1 - sli_availability) > (error_budget_remaining * 10) for: 1h labels: severity: high

Dashboards

Overview Dashboard

  • Service health (red/yellow/green)

  • Request rate

  • Error rate

  • Latency percentiles (p50, p95, p99)

  • Saturation metrics

Detailed Dashboard

  • Per-endpoint metrics

  • Dependency health

  • Database performance

  • Cache hit rates

  • Queue depths

Distributed Tracing

OpenTelemetry

const { trace } = require('@opentelemetry/api'); const tracer = trace.getTracer('my-service');

async function handleRequest(req) { const span = tracer.startSpan('handle_request');

try { span.setAttribute('user.id', req.user.id); span.setAttribute('request.path', req.path);

const result = await processRequest(req);

span.setStatus({ code: SpanStatusCode.OK });
return result;

} catch (error) { span.setStatus({ code: SpanStatusCode.ERROR, message: error.message, }); throw error; } finally { span.end(); } }

Structured Logging

logger.info('request_processed', { request_id: req.id, user_id: req.user.id, endpoint: req.path, method: req.method, status_code: res.statusCode, duration_ms: duration, error: error?.message, });

Best Practices

USE Method

For resources:

  • Utilization: % time resource is busy

  • Saturation: Work queued but not serviced

  • Errors: Error count

RED Method

For requests:

  • Rate: Requests per second

  • Errors: Failed requests per second

  • Duration: Request latency distribution

Alert on Symptoms, Not Causes

Good - alert on user impact

  • alert: HighLatency expr: p95_latency > 1s

Bad - alert on potential cause

  • alert: HighCPU expr: cpu_usage > 80%

Runbook Links

annotations: runbook: "https://wiki.example.com/runbooks/high-error-rate" dashboard: "https://grafana.example.com/d/abc123"

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

android-jetpack-compose

No summary provided by upstream source.

Repository SourceNeeds Review
General

fastapi-async-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

storybook-story-writing

No summary provided by upstream source.

Repository SourceNeeds Review
General

atomic-design-fundamentals

No summary provided by upstream source.

Repository SourceNeeds Review