Observability & Monitoring

A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.

When to Use This Skill

Use this skill when:

Setting up monitoring for production systems and applications
Implementing metrics collection and observability for microservices
Creating dashboards and visualizations for system health monitoring
Defining alerting rules and incident response automation
Analyzing system performance and capacity using time-series data
Implementing SLIs, SLOs, and SLAs for service reliability
Debugging production issues using metrics and traces
Building custom exporters for application-specific metrics
Setting up federation for multi-cluster monitoring
Migrating from legacy monitoring to cloud-native solutions
Implementing cost monitoring and optimization tracking
Creating real-time operational dashboards for DevOps teams

Core Concepts

The Four Pillars of Observability

Modern observability is built on four fundamental pillars:

Metrics: Numerical measurements of system behavior over time

Counter: Monotonically increasing values (requests served, errors)
Gauge: Point-in-time values that go up and down (memory usage, temperature)
Histogram: Distribution of values (request duration buckets)
Summary: Similar to histogram but calculates quantiles on client-side

Logs: Discrete events with contextual information

Structured logging (JSON, key-value pairs)
Centralized log aggregation (ELK, Loki)
Correlation with metrics and traces

Traces: Request flow through distributed systems

Span: Single unit of work with start/end time
Trace: Collection of spans representing end-to-end request
OpenTelemetry for distributed tracing

Events: Significant occurrences in system lifecycle

Deployments, configuration changes
Scaling events, incidents
Business events and user actions

Prometheus Architecture

Prometheus is a pull-based monitoring system with key components:

Time-Series Database (TSDB)

Stores metrics as time-series data
Efficient compression and retention policies
Local storage with optional remote storage

Scrape Targets

Service discovery (Kubernetes, Consul, EC2, etc.)
Static configuration
Relabeling for flexible target selection

PromQL Query Engine

Powerful query language for metrics analysis
Aggregation, filtering, and mathematical operations
Range vectors and instant vectors

Alertmanager

Alert rule evaluation
Grouping, silencing, and routing
Integration with PagerDuty, Slack, email, webhooks

Exporters

Bridge between Prometheus and systems
Node exporter, cAdvisor, custom exporters
Third-party exporters for databases, services

Metric Labels and Cardinality

Labels are key-value pairs attached to metrics:

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Label Best Practices:

Use labels for dimensions you query/aggregate by
Avoid high-cardinality labels (user IDs, timestamps)
Keep label names consistent across metrics
Use relabeling to normalize external labels

Cardinality Considerations:

Each unique label combination = new time-series
High cardinality = increased memory and storage
Monitor cardinality with prometheus_tsdb_symbol_table_size_bytes
Use recording rules to pre-aggregate high-cardinality metrics

Recording Rules

Pre-compute frequently-used or expensive queries:

groups:

name: api_performance interval: 30s rules:
- record: api:http_request_duration_seconds:p99 expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- record: api:http_requests:rate5m expr: rate(http_requests_total[5m])

Benefits:

Faster dashboard loading
Reduced query load on Prometheus
Consistent metric naming conventions
Enable complex aggregations

Service Level Objectives (SLOs)

Define and track reliability targets:

SLI (Service Level Indicator): Metric measuring service quality

Availability: % of successful requests
Latency: % of requests under threshold
Throughput: Requests per second

SLO (Service Level Objective): Target for SLI

99.9% availability (43.8 minutes downtime/month)
95% of requests < 200ms
1000 RPS sustained

SLA (Service Level Agreement): Contract with consequences

External commitments to customers
Financial penalties for SLO violations

Error Budget: Acceptable failure rate

Error budget = 100% - SLO
99.9% SLO = 0.1% error budget
Use budget for innovation vs. reliability tradeoff

Prometheus Setup and Configuration

Basic Prometheus Configuration

prometheus.yml

global: scrape_interval: 15s # Default scrape interval evaluation_interval: 15s # Alert rule evaluation interval external_labels: cluster: 'production' region: 'us-west-2'

Alertmanager configuration

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

Load rules

rule_files:

'rules/*.yml'
'alerts/*.yml'

Scrape configurations

scrape_configs:

Prometheus self-monitoring

job_name: 'prometheus' static_configs:
- targets: ['localhost:9090']

Node exporter for system metrics

job_name: 'node' static_configs:
- targets:
  - 'node1:9100'
  - 'node2:9100'
  - 'node3:9100' relabel_configs:
- source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'

Application metrics

job_name: 'api' static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'

Kubernetes Service Discovery

scrape_configs:

Kubernetes API server

job_name: 'kubernetes-apiservers' kubernetes_sd_configs:
- role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https

Kubernetes pods with prometheus.io annotations

job_name: 'kubernetes-pods' kubernetes_sd_configs:
- role: pod relabel_configs:
Only scrape pods with prometheus.io/scrape: "true" annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
Use the port from prometheus.io/port annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: (\d+) target_label: address replacement: ${1}:${2}
Use the path from prometheus.io/path annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path
Add namespace label
- source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace
Add pod name label
- source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name

Kubernetes services

job_name: 'kubernetes-services' kubernetes_sd_configs:
- role: service metrics_path: /probe params: module: [http_2xx] relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe] action: keep regex: true
- source_labels: [address] target_label: __param_target
- target_label: address replacement: blackbox-exporter:9115
- source_labels: [__param_target] target_label: instance

Storage and Retention

Storage configuration

storage: tsdb: path: /prometheus/data retention.time: 15d retention.size: 50GB

Remote write for long-term storage

remote_write:

url: "https://prometheus-remote-storage.example.com/api/v1/write" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password queue_config: capacity: 10000 max_shards: 50 max_samples_per_send: 5000 write_relabel_configs:
Drop high-cardinality metrics
- source_labels: [name] regex: 'container_network_.*' action: drop

Remote read for querying historical data

remote_read:

url: "https://prometheus-remote-storage.example.com/api/v1/read" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password read_recent: true

PromQL: The Prometheus Query Language

Instant Vectors and Selectors

Basic metric selection

http_requests_total

Filter by label

http_requests_total{job="api", status="200"}

Regex matching

http_requests_total{status=~"2..|3.."}

Negative matching

http_requests_total{status!="500"}

Multiple label matchers

http_requests_total{job="api", method="GET", status=~"2.."}

Range Vectors and Aggregations

5-minute range vector

http_requests_total[5m]

Rate of increase per second

rate(http_requests_total[5m])

Increase over time window

increase(http_requests_total[1h])

Average over time

avg_over_time(cpu_usage[5m])

Max/Min over time

max_over_time(response_time_seconds[10m]) min_over_time(response_time_seconds[10m])

Standard deviation

stddev_over_time(response_time_seconds[5m])

Aggregation Operators

Sum across all instances

sum(rate(http_requests_total[5m]))

Sum grouped by job

sum by (job) (rate(http_requests_total[5m]))

Average grouped by multiple labels

avg by (job, instance) (cpu_usage)

Count number of series

count(up == 1)

Topk and bottomk

topk(5, rate(http_requests_total[5m])) bottomk(3, node_memory_available_bytes)

Quantile across instances

quantile(0.95, http_request_duration_seconds)

Mathematical Operations

Arithmetic operations

(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100

Comparison operators

http_request_duration_seconds > 0.5

Logical operators

up == 1 and rate(http_requests_total[5m]) > 100

Vector matching

rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])

Advanced PromQL Patterns

Request success rate

sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Error rate percentage

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Latency percentiles (histogram)

histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) )

Predict linear growth

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)

Detect anomalies with standard deviation

abs(cpu_usage - avg_over_time(cpu_usage[1h]))

3 * stddev_over_time(cpu_usage[1h])

Calculate saturation (RED method)

sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance) / count(cpu_seconds_total{mode="idle"}) by (instance)

Burn rate for SLO

( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) )

(14.4 * (1 - 0.999)) # For 99.9% SLO

Alerting with Prometheus and Alertmanager

Alert Rule Definitions

alerts/api_alerts.yml

groups:

name: api_alerts interval: 30s rules:

High error rate alert
- alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)
  
  0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}" runbook_url: "https://runbooks.example.com/HighErrorRate"
High latency alert
- alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.service }}" description: "P99 latency is {{ $value }}s on {{ $labels.service }}"
Service down alert
- alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"
Disk space alert
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"
Memory pressure alert
- alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"
CPU saturation alert
- alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"

Alertmanager Configuration

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'

Templates for notifications

templates:

'/etc/alertmanager/templates/*.tmpl'

Route tree for alert distribution

route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-default'

routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' continue: true

# Critical alerts also go to Slack
- match:
    severity: critical
  receiver: 'slack-critical'
  group_wait: 0s

# Warning alerts to Slack only
- match:
    severity: warning
  receiver: 'slack-warnings'

# Team-specific routing
- match:
    team: backend
  receiver: 'team-backend'

- match:
    team: frontend
  receiver: 'team-frontend'

# Database alerts to DBA team
- match_re:
    service: 'postgres|mysql|mongodb'
  receiver: 'team-dba'

Alert receivers/integrations

receivers:

name: 'team-default' slack_configs:
- channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
name: 'pagerduty-critical' pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
name: 'slack-critical' slack_configs:
- channel: '#incidents' title: 'CRITICAL: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
name: 'slack-warnings' slack_configs:
- channel: '#monitoring' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
name: 'team-backend' slack_configs:
- channel: '#team-backend' send_resolved: true email_configs:
- to: 'backend-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alertmanager@example.com' auth_password_file: '/etc/alertmanager/email_password'
name: 'team-frontend' slack_configs:
- channel: '#team-frontend' send_resolved: true
name: 'team-dba' slack_configs:
- channel: '#team-dba' send_resolved: true pagerduty_configs:
- service_key: 'DBA_PAGERDUTY_KEY'

Inhibition rules (suppress alerts)

inhibit_rules:

Inhibit warnings if critical alert is firing

source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']

Don't alert on instance down if cluster is down

source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']

Multi-Window Multi-Burn-Rate Alerts for SLOs

SLO-based alerting using burn rate

groups:

name: slo_alerts interval: 30s rules:

Fast burn (1h window, 5m burn)
- alert: ErrorBudgetBurnFast expr: | ( sum(rate(http_requests_total{status=~~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status=~~"5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "Fast error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"
Slow burn (6h window, 30m burn)
- alert: ErrorBudgetBurnSlow expr: | ( sum(rate(http_requests_total{status=~~"5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status=~~"5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "Slow error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"

Grafana Dashboards and Visualization

Dashboard JSON Structure

{ "dashboard": { "title": "API Performance Dashboard", "tags": ["api", "performance", "production"], "timezone": "browser", "editable": true, "graphTooltip": 1, "time": { "from": "now-6h", "to": "now" }, "timepicker": { "refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"], "time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"] }, "templating": { "list": [ { "name": "cluster", "type": "query", "datasource": "Prometheus", "query": "label_values(up, cluster)", "refresh": 1, "multi": false, "includeAll": false }, { "name": "service", "type": "query", "datasource": "Prometheus", "query": "label_values(up{cluster="$cluster"}, service)", "refresh": 1, "multi": true, "includeAll": true }, { "name": "interval", "type": "interval", "query": "1m,5m,10m,30m,1h", "auto": true, "auto_count": 30, "auto_min": "10s" } ] }, "panels": [ { "id": 1, "title": "Request Rate", "type": "graph", "gridPos": {"h": 8, "w": 12, "x": 0, "y": 0}, "targets": [ { "expr": "sum(rate(http_requests_total{service="$service"}[$interval])) by (service)", "legendFormat": "{{ service }}", "refId": "A" } ], "yaxes": [ {"format": "reqps", "label": "Requests/sec"}, {"format": "short"} ], "legend": { "show": true, "values": true, "current": true, "avg": true, "max": true } }, { "id": 2, "title": "Error Rate", "type": "graph", "gridPos": {"h": 8, "w": 12, "x": 12, "y": 0}, "targets": [ { "expr": "sum(rate(http_requests_total{service="$service",status=~~"5.."}[$interval])) by (service) / sum(rate(http_requests_total{service=~~"$service"}[$interval])) by (service) * 100", "legendFormat": "{{ service }} error %", "refId": "A" } ], "yaxes": [ {"format": "percent", "label": "Error Rate"}, {"format": "short"} ], "alert": { "conditions": [ { "evaluator": {"params": [5], "type": "gt"}, "operator": {"type": "and"}, "query": {"params": ["A", "5m", "now"]}, "reducer": {"params": [], "type": "avg"}, "type": "query" } ], "executionErrorState": "alerting", "frequency": "1m", "handler": 1, "name": "High Error Rate", "noDataState": "no_data", "notifications": [] } }, { "id": 3, "title": "Latency Percentiles", "type": "graph", "gridPos": {"h": 8, "w": 24, "x": 0, "y": 8}, "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service=~~"$service"}[$interval])) by (service, le))", "legendFormat": "{{ service }} p99", "refId": "A" }, { "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service=~~"$service"}[$interval])) by (service, le))", "legendFormat": "{{ service }} p95", "refId": "B" }, { "expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~"$service"}[$interval])) by (service, le))", "legendFormat": "{{ service }} p50", "refId": "C" } ], "yaxes": [ {"format": "s", "label": "Duration"}, {"format": "short"} ] } ] } }

RED Method Dashboard

The RED method focuses on Request rate, Error rate, and Duration:

{ "panels": [ { "title": "Request Rate (per service)", "targets": [ { "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)" } ] }, { "title": "Error Rate % (per service)", "targets": [ { "expr": "sum(rate(http_requests_total{status=~"5.."}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100" } ] }, { "title": "Duration p99 (per service)", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))" } ] } ] }

USE Method Dashboard

The USE method monitors Utilization, Saturation, and Errors:

{ "panels": [ { "title": "CPU Utilization %", "targets": [ { "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) * 100)" } ] }, { "title": "CPU Saturation (Load Average)", "targets": [ { "expr": "node_load1" } ] }, { "title": "Memory Utilization %", "targets": [ { "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100" } ] }, { "title": "Disk I/O Utilization %", "targets": [ { "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100" } ] }, { "title": "Network Errors", "targets": [ { "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])" } ] } ] }

Exporters and Metric Collection

Node Exporter for System Metrics

Install node_exporter

wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz cd node_exporter-1.6.1.linux-amd64 ./node_exporter --web.listen-address=":9100"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"

Key Metrics from Node Exporter:

node_cpu_seconds_total : CPU usage by mode
node_memory_MemTotal_bytes , node_memory_MemAvailable_bytes : Memory
node_disk_io_time_seconds_total : Disk I/O
node_network_receive_bytes_total , node_network_transmit_bytes_total : Network
node_filesystem_size_bytes , node_filesystem_avail_bytes : Disk space

Custom Application Exporter (Python)

app_exporter.py

from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary import time import random

Define metrics

REQUEST_COUNT = Counter( 'app_requests_total', 'Total app requests', ['method', 'endpoint', 'status'] )

REQUEST_DURATION = Histogram( 'app_request_duration_seconds', 'Request duration in seconds', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0] )

ACTIVE_USERS = Gauge( 'app_active_users', 'Number of active users' )

QUEUE_SIZE = Gauge( 'app_queue_size', 'Current queue size', ['queue_name'] )

DATABASE_CONNECTIONS = Gauge( 'app_database_connections', 'Number of database connections', ['pool', 'state'] )

CACHE_HITS = Counter( 'app_cache_hits_total', 'Total cache hits', ['cache_name'] )

CACHE_MISSES = Counter( 'app_cache_misses_total', 'Total cache misses', ['cache_name'] )

def simulate_metrics(): """Simulate application metrics""" while True: # Simulate requests method = random.choice(['GET', 'POST', 'PUT', 'DELETE']) endpoint = random.choice(['/api/users', '/api/products', '/api/orders']) status = random.choice(['200', '200', '200', '400', '500'])

    REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()

    # Simulate request duration
    duration = random.uniform(0.01, 2.0)
    REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)

    # Update gauges
    ACTIVE_USERS.set(random.randint(100, 1000))
    QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
    QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))

    # Database connection pool
    DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
    DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))

    # Cache metrics
    if random.random() > 0.3:
        CACHE_HITS.labels(cache_name='redis').inc()
    else:
        CACHE_MISSES.labels(cache_name='redis').inc()

    time.sleep(1)

if name == 'main': # Start metrics server on port 8000 start_http_server(8000) print("Metrics server started on port 8000") simulate_metrics()

Custom Exporter (Go)

package main

import ( "log" "math/rand" "net/http" "time"

"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"

)

var ( requestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "app_requests_total", Help: "Total number of requests", }, []string{"method", "endpoint", "status"}, )

requestDuration = prometheus.NewHistogramVec(
    prometheus.HistogramOpts{
        Name:    "app_request_duration_seconds",
        Help:    "Request duration in seconds",
        Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
    },
    []string{"method", "endpoint"},
)

activeUsers = prometheus.NewGauge(
    prometheus.GaugeOpts{
        Name: "app_active_users",
        Help: "Number of active users",
    },
)

databaseConnections = prometheus.NewGaugeVec(
    prometheus.GaugeOpts{
        Name: "app_database_connections",
        Help: "Number of database connections",
    },
    []string{"pool", "state"},
)

)

func init() { prometheus.MustRegister(requestsTotal) prometheus.MustRegister(requestDuration) prometheus.MustRegister(activeUsers) prometheus.MustRegister(databaseConnections) }

func simulateMetrics() { ticker := time.NewTicker(1 * time.Second) defer ticker.Stop()

for range ticker.C {
    // Simulate requests
    methods := []string{"GET", "POST", "PUT", "DELETE"}
    endpoints := []string{"/api/users", "/api/products", "/api/orders"}
    statuses := []string{"200", "200", "200", "400", "500"}

    method := methods[rand.Intn(len(methods))]
    endpoint := endpoints[rand.Intn(len(endpoints))]
    status := statuses[rand.Intn(len(statuses))]

    requestsTotal.WithLabelValues(method, endpoint, status).Inc()
    requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)

    // Update gauges
    activeUsers.Set(float64(rand.Intn(900) + 100))
    databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
    databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
}

}

func main() { go simulateMetrics()

http.Handle("/metrics", promhttp.Handler())
log.Println("Starting metrics server on :8000")
log.Fatal(http.ListenAndServe(":8000", nil))

}

PostgreSQL Exporter

docker-compose.yml for postgres_exporter

version: '3.8' services: postgres-exporter: image: prometheuscommunity/postgres-exporter environment: DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable" ports: - "9187:9187" command: - '--collector.stat_statements' - '--collector.stat_database' - '--collector.replication'

Key PostgreSQL Metrics:

pg_up : Database reachability
pg_stat_database_tup_returned : Rows read
pg_stat_database_tup_inserted : Rows inserted
pg_stat_database_deadlocks : Deadlock count
pg_stat_replication_lag : Replication lag in seconds
pg_locks_count : Active locks

Blackbox Exporter for Probing

blackbox.yml

modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET preferred_ip_protocol: "ip4"

http_post_json: prober: http http: method: POST headers: Content-Type: application/json body: '{"key":"value"}' valid_status_codes: [200, 201]

tcp_connect: prober: tcp timeout: 5s

icmp: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4"

Prometheus config for blackbox exporter

scrape_configs:

job_name: 'blackbox-http' metrics_path: /probe params: module: [http_2xx] static_configs:
- targets:
  - https://example.com
  - https://api.example.com/health relabel_configs:
- source_labels: [address] target_label: __param_target
- source_labels: [__param_target] target_label: instance
- target_label: address replacement: blackbox-exporter:9115

Best Practices

Metric Naming Conventions

Follow Prometheus naming best practices:

Format: <namespace><subsystem><metric>_<unit>

Good examples

http_requests_total # Counter http_request_duration_seconds # Histogram database_connections_active # Gauge cache_hits_total # Counter memory_usage_bytes # Gauge

Include unit suffixes

_seconds, _bytes, _total, _ratio, _percentage

Avoid

RequestCount # Use snake_case http_requests # Missing _total for counter request_time # Missing unit (should be _seconds)

Label Guidelines

Good: Low cardinality labels

http_requests_total{method="GET", endpoint="/api/users", status="200"}

Bad: High cardinality labels (avoid)

http_requests_total{user_id="12345", session_id="abc-def-ghi"}

Good: Use bounded label values

http_requests_total{status_class="2xx"}

Bad: Unbounded label values

http_requests_total{response_size="1234567"}

Recording Rule Patterns

groups:

name: performance_rules interval: 30s rules:

Pre-aggregate expensive queries
- record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)
Namespace aggregations
- record: namespace:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (namespace)
SLI calculations
- record: job:http_requests_success:rate5m expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)
- record: job:http_requests_error_rate:ratio expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)

Alert Design Principles

Alert on symptoms, not causes: Alert on user-facing issues
Make alerts actionable: Include runbook links
Use appropriate severity levels: Critical, warning, info
Set proper thresholds: Based on historical data
Include context in annotations: Help on-call engineers
Group related alerts: Reduce alert fatigue
Use inhibition rules: Suppress redundant alerts
Test alert rules: Verify they fire when expected

Dashboard Best Practices

One dashboard per audience: SRE, developers, business
Use consistent time ranges: Make comparisons easier
Include SLI/SLO metrics: Show business impact
Add annotations for deploys: Correlate changes with metrics
Use template variables: Make dashboards reusable
Show trends and aggregates: Not just raw metrics
Include links to runbooks: Enable quick response
Use appropriate visualizations: Graphs, gauges, tables

High Availability Setup

Prometheus HA with Thanos

Deploy multiple Prometheus instances with same config

Use Thanos to deduplicate and provide global view

prometheus-1.yml

global: external_labels: cluster: 'prod' replica: '1'

prometheus-2.yml

global: external_labels: cluster: 'prod' replica: '2'

Thanos sidecar configuration

Uploads blocks to object storage

Provides StoreAPI for querying

Capacity Planning Queries

Disk space exhaustion prediction

predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

Memory growth trend

predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)

Request rate growth

predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)

Storage capacity planning

prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)

Advanced Patterns

Federation for Multi-Cluster Monitoring

Global Prometheus federating from cluster Prometheus instances

scrape_configs:

job_name: 'federate' scrape_interval: 15s honor_labels: true metrics_path: '/federate' params: 'match[]': - '{job="prometheus"}' - '{name=~"job:.*"}' # All recording rules static_configs:
- targets:
  - 'prometheus-us-west:9090'
  - 'prometheus-us-east:9090'
  - 'prometheus-eu-central:9090'

Cost Monitoring Pattern

Track cloud costs with custom metrics

groups:

name: cost_tracking rules:
- record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # CPU cost/hour + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # Memory cost/hour )
- record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # Hours in average month

Custom SLO Implementation

SLO: 99.9% availability for API

groups:

name: api_slo interval: 30s rules:

Success rate SLI
- record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))
Error budget remaining (30 days)
- record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )
Latency SLI (p99 < 500ms)
- record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )

Examples Summary

This skill includes 20+ comprehensive examples covering:

Prometheus configuration (basic, Kubernetes SD, storage)
PromQL queries (instant vectors, range vectors, aggregations)
Mathematical operations and advanced patterns
Alert rule definitions (error rate, latency, resource usage)
Alertmanager configuration (routing, receivers, inhibition)
Multi-window multi-burn-rate SLO alerts
Grafana dashboard JSON (full dashboard, RED method, USE method)
Custom exporters (Python, Go)
Third-party exporters (PostgreSQL, Blackbox)
Recording rules for performance
Federation for multi-cluster monitoring
Cost monitoring and SLO implementation
High availability patterns
Capacity planning queries

Skill Version: 1.0.0 Last Updated: October 2025 Skill Category: Observability, Monitoring, SRE, DevOps Compatible With: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes

observability-monitoring

Safety Notice

Copy this and send it to your AI assistant to learn

prometheus.yml

Alertmanager configuration

Load rules

Scrape configurations

Prometheus self-monitoring

Node exporter for system metrics

Application metrics

Kubernetes API server

Kubernetes pods with prometheus.io annotations

Only scrape pods with prometheus.io/scrape: "true" annotation

Use the port from prometheus.io/port annotation

Use the path from prometheus.io/path annotation

Add namespace label

Add pod name label

Kubernetes services

Storage configuration

Remote write for long-term storage

Drop high-cardinality metrics

Remote read for querying historical data

Basic metric selection

Filter by label

Regex matching

Negative matching

Multiple label matchers

5-minute range vector

Rate of increase per second

Increase over time window

Average over time

Max/Min over time

Standard deviation

Sum across all instances

Sum grouped by job

Average grouped by multiple labels

Count number of series

Topk and bottomk

Quantile across instances

Arithmetic operations

Comparison operators

Logical operators

Vector matching

Request success rate

Error rate percentage

Latency percentiles (histogram)

Predict linear growth

Detect anomalies with standard deviation

Calculate saturation (RED method)

Burn rate for SLO

alerts/api_alerts.yml

High error rate alert

High latency alert

Service down alert

Disk space alert

Memory pressure alert

CPU saturation alert

alertmanager.yml

Templates for notifications

Route tree for alert distribution

Alert receivers/integrations

Inhibition rules (suppress alerts)

Inhibit warnings if critical alert is firing

Don't alert on instance down if cluster is down

SLO-based alerting using burn rate

Fast burn (1h window, 5m burn)

Slow burn (6h window, 30m burn)

Install node_exporter

app_exporter.py

Define metrics

docker-compose.yml for postgres_exporter

blackbox.yml

Prometheus config for blackbox exporter

Format: <namespace><subsystem><metric>_<unit>

Good examples

Include unit suffixes

Avoid

Good: Low cardinality labels

Bad: High cardinality labels (avoid)

Good: Use bounded label values