Observability & Monitoring
A comprehensive skill for implementing production-grade observability and monitoring using Prometheus, Grafana, and the wider cloud-native monitoring ecosystem. This skill covers metrics collection, time-series analysis, alerting, visualization, and operational excellence patterns.
When to Use This Skill
Use this skill when:
-
Setting up monitoring for production systems and applications
-
Implementing metrics collection and observability for microservices
-
Creating dashboards and visualizations for system health monitoring
-
Defining alerting rules and incident response automation
-
Analyzing system performance and capacity using time-series data
-
Implementing SLIs, SLOs, and SLAs for service reliability
-
Debugging production issues using metrics and traces
-
Building custom exporters for application-specific metrics
-
Setting up federation for multi-cluster monitoring
-
Migrating from legacy monitoring to cloud-native solutions
-
Implementing cost monitoring and optimization tracking
-
Creating real-time operational dashboards for DevOps teams
Core Concepts
The Four Pillars of Observability
Modern observability is built on four fundamental pillars:
Metrics: Numerical measurements of system behavior over time
-
Counter: Monotonically increasing values (requests served, errors)
-
Gauge: Point-in-time values that go up and down (memory usage, temperature)
-
Histogram: Distribution of values (request duration buckets)
-
Summary: Similar to histogram but calculates quantiles on client-side
Logs: Discrete events with contextual information
-
Structured logging (JSON, key-value pairs)
-
Centralized log aggregation (ELK, Loki)
-
Correlation with metrics and traces
Traces: Request flow through distributed systems
-
Span: Single unit of work with start/end time
-
Trace: Collection of spans representing end-to-end request
-
OpenTelemetry for distributed tracing
Events: Significant occurrences in system lifecycle
-
Deployments, configuration changes
-
Scaling events, incidents
-
Business events and user actions
Prometheus Architecture
Prometheus is a pull-based monitoring system with key components:
Time-Series Database (TSDB)
-
Stores metrics as time-series data
-
Efficient compression and retention policies
-
Local storage with optional remote storage
Scrape Targets
-
Service discovery (Kubernetes, Consul, EC2, etc.)
-
Static configuration
-
Relabeling for flexible target selection
PromQL Query Engine
-
Powerful query language for metrics analysis
-
Aggregation, filtering, and mathematical operations
-
Range vectors and instant vectors
Alertmanager
-
Alert rule evaluation
-
Grouping, silencing, and routing
-
Integration with PagerDuty, Slack, email, webhooks
Exporters
-
Bridge between Prometheus and systems
-
Node exporter, cAdvisor, custom exporters
-
Third-party exporters for databases, services
Metric Labels and Cardinality
Labels are key-value pairs attached to metrics:
http_requests_total{method="GET", endpoint="/api/users", status="200"}
Label Best Practices:
-
Use labels for dimensions you query/aggregate by
-
Avoid high-cardinality labels (user IDs, timestamps)
-
Keep label names consistent across metrics
-
Use relabeling to normalize external labels
Cardinality Considerations:
-
Each unique label combination = new time-series
-
High cardinality = increased memory and storage
-
Monitor cardinality with prometheus_tsdb_symbol_table_size_bytes
-
Use recording rules to pre-aggregate high-cardinality metrics
Recording Rules
Pre-compute frequently-used or expensive queries:
groups:
- name: api_performance
interval: 30s
rules:
- record: api:http_request_duration_seconds:p99 expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- record: api:http_requests:rate5m expr: rate(http_requests_total[5m])
Benefits:
-
Faster dashboard loading
-
Reduced query load on Prometheus
-
Consistent metric naming conventions
-
Enable complex aggregations
Service Level Objectives (SLOs)
Define and track reliability targets:
SLI (Service Level Indicator): Metric measuring service quality
-
Availability: % of successful requests
-
Latency: % of requests under threshold
-
Throughput: Requests per second
SLO (Service Level Objective): Target for SLI
-
99.9% availability (43.8 minutes downtime/month)
-
95% of requests < 200ms
-
1000 RPS sustained
SLA (Service Level Agreement): Contract with consequences
-
External commitments to customers
-
Financial penalties for SLO violations
Error Budget: Acceptable failure rate
-
Error budget = 100% - SLO
-
99.9% SLO = 0.1% error budget
-
Use budget for innovation vs. reliability tradeoff
Prometheus Setup and Configuration
Basic Prometheus Configuration
prometheus.yml
global: scrape_interval: 15s # Default scrape interval evaluation_interval: 15s # Alert rule evaluation interval external_labels: cluster: 'production' region: 'us-west-2'
Alertmanager configuration
alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093
Load rules
rule_files:
- 'rules/*.yml'
- 'alerts/*.yml'
Scrape configurations
scrape_configs:
Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
Node exporter for system metrics
- job_name: 'node'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100' relabel_configs:
- source_labels: [address] target_label: instance regex: '([^:]+):.*' replacement: '${1}'
- targets:
Application metrics
- job_name: 'api'
static_configs:
- targets: ['api-1:8080', 'api-2:8080', 'api-3:8080'] labels: env: 'production' tier: 'backend'
Kubernetes Service Discovery
scrape_configs:
Kubernetes API server
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints scheme: https tls_config: ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name] action: keep regex: default;kubernetes;https
Kubernetes pods with prometheus.io annotations
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod relabel_configs:
Only scrape pods with prometheus.io/scrape: "true" annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
Use the port from prometheus.io/port annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: (\d+) target_label: address replacement: ${1}:${2}
Use the path from prometheus.io/path annotation
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: metrics_path
Add namespace label
- source_labels: [__meta_kubernetes_namespace] action: replace target_label: kubernetes_namespace
Add pod name label
- source_labels: [__meta_kubernetes_pod_name] action: replace target_label: kubernetes_pod_name
Kubernetes services
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service metrics_path: /probe params: module: [http_2xx] relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe] action: keep regex: true
- source_labels: [address] target_label: __param_target
- target_label: address replacement: blackbox-exporter:9115
- source_labels: [__param_target] target_label: instance
Storage and Retention
Storage configuration
storage: tsdb: path: /prometheus/data retention.time: 15d retention.size: 50GB
Remote write for long-term storage
remote_write:
- url: "https://prometheus-remote-storage.example.com/api/v1/write"
basic_auth:
username: prometheus
password_file: /etc/prometheus/remote_storage_password
queue_config:
capacity: 10000
max_shards: 50
max_samples_per_send: 5000
write_relabel_configs:
Drop high-cardinality metrics
- source_labels: [name] regex: 'container_network_.*' action: drop
Remote read for querying historical data
remote_read:
- url: "https://prometheus-remote-storage.example.com/api/v1/read" basic_auth: username: prometheus password_file: /etc/prometheus/remote_storage_password read_recent: true
PromQL: The Prometheus Query Language
Instant Vectors and Selectors
Basic metric selection
http_requests_total
Filter by label
http_requests_total{job="api", status="200"}
Regex matching
http_requests_total{status=~"2..|3.."}
Negative matching
http_requests_total{status!="500"}
Multiple label matchers
http_requests_total{job="api", method="GET", status=~"2.."}
Range Vectors and Aggregations
5-minute range vector
http_requests_total[5m]
Rate of increase per second
rate(http_requests_total[5m])
Increase over time window
increase(http_requests_total[1h])
Average over time
avg_over_time(cpu_usage[5m])
Max/Min over time
max_over_time(response_time_seconds[10m]) min_over_time(response_time_seconds[10m])
Standard deviation
stddev_over_time(response_time_seconds[5m])
Aggregation Operators
Sum across all instances
sum(rate(http_requests_total[5m]))
Sum grouped by job
sum by (job) (rate(http_requests_total[5m]))
Average grouped by multiple labels
avg by (job, instance) (cpu_usage)
Count number of series
count(up == 1)
Topk and bottomk
topk(5, rate(http_requests_total[5m])) bottomk(3, node_memory_available_bytes)
Quantile across instances
quantile(0.95, http_request_duration_seconds)
Mathematical Operations
Arithmetic operations
(node_memory_total_bytes - node_memory_available_bytes) / node_memory_total_bytes * 100
Comparison operators
http_request_duration_seconds > 0.5
Logical operators
up == 1 and rate(http_requests_total[5m]) > 100
Vector matching
rate(http_requests_total[5m]) / on(instance) group_left rate(http_responses_total[5m])
Advanced PromQL Patterns
Request success rate
sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Latency percentiles (histogram)
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])) )
Predict linear growth
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600)
Detect anomalies with standard deviation
abs(cpu_usage - avg_over_time(cpu_usage[1h]))
3 * stddev_over_time(cpu_usage[1h])
Calculate saturation (RED method)
sum(rate(cpu_seconds_total{mode!="idle"}[5m])) by (instance) / count(cpu_seconds_total{mode="idle"}) by (instance)
Burn rate for SLO
( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) )
(14.4 * (1 - 0.999)) # For 99.9% SLO
Alerting with Prometheus and Alertmanager
Alert Rule Definitions
alerts/api_alerts.yml
groups:
-
name: api_alerts interval: 30s rules:
High error rate alert
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
0.05 for: 5m labels: severity: critical team: backend annotations: summary: "High error rate on {{ $labels.service }}" description: "Error rate is {{ $value | humanizePercentage }} on {{ $labels.service }}" runbook_url: "https://runbooks.example.com/HighErrorRate"
High latency alert
- alert: HighLatency expr: | histogram_quantile(0.99, sum by (service, le) (rate(http_request_duration_seconds_bucket[5m])) ) > 1 for: 10m labels: severity: warning team: backend annotations: summary: "High latency on {{ $labels.service }}" description: "P99 latency is {{ $value }}s on {{ $labels.service }}"
Service down alert
- alert: ServiceDown expr: up{job="api"} == 0 for: 2m labels: severity: critical team: sre annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} on {{ $labels.instance }} has been down for more than 2 minutes"
Disk space alert
- alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 10 for: 5m labels: severity: warning team: sre annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is {{ $value | humanize }}% on {{ $labels.instance }}"
Memory pressure alert
- alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning team: sre annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanize }}% on {{ $labels.instance }}"
CPU saturation alert
- alert: HighCPUUsage expr: | 100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 15m labels: severity: warning team: sre annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value | humanize }}% on {{ $labels.instance }}"
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
Alertmanager Configuration
alertmanager.yml
global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK' pagerduty_url: 'https://events.pagerduty.com/v2/enqueue'
Templates for notifications
templates:
- '/etc/alertmanager/templates/*.tmpl'
Route tree for alert distribution
route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'team-default'
routes: # Critical alerts go to PagerDuty - match: severity: critical receiver: 'pagerduty-critical' continue: true
# Critical alerts also go to Slack
- match:
severity: critical
receiver: 'slack-critical'
group_wait: 0s
# Warning alerts to Slack only
- match:
severity: warning
receiver: 'slack-warnings'
# Team-specific routing
- match:
team: backend
receiver: 'team-backend'
- match:
team: frontend
receiver: 'team-frontend'
# Database alerts to DBA team
- match_re:
service: 'postgres|mysql|mongodb'
receiver: 'team-dba'
Alert receivers/integrations
receivers:
-
name: 'team-default' slack_configs:
- channel: '#alerts' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' send_resolved: true
-
name: 'pagerduty-critical' pagerduty_configs:
- service_key: 'YOUR_PAGERDUTY_SERVICE_KEY' description: '{{ .GroupLabels.alertname }}: {{ .GroupLabels.service }}' severity: '{{ .CommonLabels.severity }}'
-
name: 'slack-critical' slack_configs:
- channel: '#incidents' title: 'CRITICAL: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.summary }}\n{{ .Annotations.description }}{{ end }}' color: 'danger' send_resolved: true
-
name: 'slack-warnings' slack_configs:
- channel: '#monitoring' title: 'Warning: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' color: 'warning' send_resolved: true
-
name: 'team-backend' slack_configs:
- channel: '#team-backend' send_resolved: true email_configs:
- to: 'backend-team@example.com' from: 'alertmanager@example.com' smarthost: 'smtp.gmail.com:587' auth_username: 'alertmanager@example.com' auth_password_file: '/etc/alertmanager/email_password'
-
name: 'team-frontend' slack_configs:
- channel: '#team-frontend' send_resolved: true
-
name: 'team-dba' slack_configs:
- channel: '#team-dba' send_resolved: true pagerduty_configs:
- service_key: 'DBA_PAGERDUTY_KEY'
Inhibition rules (suppress alerts)
inhibit_rules:
Inhibit warnings if critical alert is firing
- source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'instance']
Don't alert on instance down if cluster is down
- source_match: alertname: 'ClusterDown' target_match_re: alertname: 'InstanceDown|ServiceDown' equal: ['cluster']
Multi-Window Multi-Burn-Rate Alerts for SLOs
SLO-based alerting using burn rate
groups:
-
name: slo_alerts interval: 30s rules:
Fast burn (1h window, 5m burn)
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=
"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (14.4 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[5m])) / sum(rate(http_requests_total[5m])) ) > (14.4 * (1 - 0.999)) for: 2m labels: severity: critical slo: "99.9%" annotations: summary: "Fast error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 14.4x faster than normal"
Slow burn (6h window, 30m burn)
- alert: ErrorBudgetBurnSlow
expr: |
(
sum(rate(http_requests_total{status=
"5.."}[6h])) / sum(rate(http_requests_total[6h])) ) > (6 * (1 - 0.999)) and ( sum(rate(http_requests_total{status="5.."}[30m])) / sum(rate(http_requests_total[30m])) ) > (6 * (1 - 0.999)) for: 15m labels: severity: warning slo: "99.9%" annotations: summary: "Slow error budget burn detected" description: "Error rate is burning through 99.9% SLO budget 6x faster than normal"
- alert: ErrorBudgetBurnFast
expr: |
(
sum(rate(http_requests_total{status=
Grafana Dashboards and Visualization
Dashboard JSON Structure
{
"dashboard": {
"title": "API Performance Dashboard",
"tags": ["api", "performance", "production"],
"timezone": "browser",
"editable": true,
"graphTooltip": 1,
"time": {
"from": "now-6h",
"to": "now"
},
"timepicker": {
"refresh_intervals": ["5s", "10s", "30s", "1m", "5m", "15m"],
"time_options": ["5m", "15m", "1h", "6h", "12h", "24h", "7d"]
},
"templating": {
"list": [
{
"name": "cluster",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up, cluster)",
"refresh": 1,
"multi": false,
"includeAll": false
},
{
"name": "service",
"type": "query",
"datasource": "Prometheus",
"query": "label_values(up{cluster="$cluster"}, service)",
"refresh": 1,
"multi": true,
"includeAll": true
},
{
"name": "interval",
"type": "interval",
"query": "1m,5m,10m,30m,1h",
"auto": true,
"auto_count": 30,
"auto_min": "10s"
}
]
},
"panels": [
{
"id": 1,
"title": "Request Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service="$service"}[$interval])) by (service)",
"legendFormat": "{{ service }}",
"refId": "A"
}
],
"yaxes": [
{"format": "reqps", "label": "Requests/sec"},
{"format": "short"}
],
"legend": {
"show": true,
"values": true,
"current": true,
"avg": true,
"max": true
}
},
{
"id": 2,
"title": "Error Rate",
"type": "graph",
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"targets": [
{
"expr": "sum(rate(http_requests_total{service="$service",status="5.."}[$interval])) by (service) / sum(rate(http_requests_total{service="$service"}[$interval])) by (service) * 100",
"legendFormat": "{{ service }} error %",
"refId": "A"
}
],
"yaxes": [
{"format": "percent", "label": "Error Rate"},
{"format": "short"}
],
"alert": {
"conditions": [
{
"evaluator": {"params": [5], "type": "gt"},
"operator": {"type": "and"},
"query": {"params": ["A", "5m", "now"]},
"reducer": {"params": [], "type": "avg"},
"type": "query"
}
],
"executionErrorState": "alerting",
"frequency": "1m",
"handler": 1,
"name": "High Error Rate",
"noDataState": "no_data",
"notifications": []
}
},
{
"id": 3,
"title": "Latency Percentiles",
"type": "graph",
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{service="$service"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p99",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{service="$service"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p95",
"refId": "B"
},
{
"expr": "histogram_quantile(0.50, sum(rate(http_request_duration_seconds_bucket{service=~"$service"}[$interval])) by (service, le))",
"legendFormat": "{{ service }} p50",
"refId": "C"
}
],
"yaxes": [
{"format": "s", "label": "Duration"},
{"format": "short"}
]
}
]
}
}
RED Method Dashboard
The RED method focuses on Request rate, Error rate, and Duration:
{ "panels": [ { "title": "Request Rate (per service)", "targets": [ { "expr": "sum(rate(http_requests_total[$__rate_interval])) by (service)" } ] }, { "title": "Error Rate % (per service)", "targets": [ { "expr": "sum(rate(http_requests_total{status=~"5.."}[$__rate_interval])) by (service) / sum(rate(http_requests_total[$__rate_interval])) by (service) * 100" } ] }, { "title": "Duration p99 (per service)", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[$__rate_interval])) by (service, le))" } ] } ] }
USE Method Dashboard
The USE method monitors Utilization, Saturation, and Errors:
{ "panels": [ { "title": "CPU Utilization %", "targets": [ { "expr": "100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[$__rate_interval])) * 100)" } ] }, { "title": "CPU Saturation (Load Average)", "targets": [ { "expr": "node_load1" } ] }, { "title": "Memory Utilization %", "targets": [ { "expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100" } ] }, { "title": "Disk I/O Utilization %", "targets": [ { "expr": "rate(node_disk_io_time_seconds_total[$__rate_interval]) * 100" } ] }, { "title": "Network Errors", "targets": [ { "expr": "rate(node_network_receive_errs_total[$__rate_interval]) + rate(node_network_transmit_errs_total[$__rate_interval])" } ] } ] }
Exporters and Metric Collection
Node Exporter for System Metrics
Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
cd node_exporter-1.6.1.linux-amd64
./node_exporter --web.listen-address=":9100"
--collector.filesystem.mount-points-exclude="^/(dev|proc|sys|var/lib/docker/.+)($|/)"
--collector.netclass.ignored-devices="^(veth.|br.|docker.*|lo)$"
Key Metrics from Node Exporter:
-
node_cpu_seconds_total : CPU usage by mode
-
node_memory_MemTotal_bytes , node_memory_MemAvailable_bytes : Memory
-
node_disk_io_time_seconds_total : Disk I/O
-
node_network_receive_bytes_total , node_network_transmit_bytes_total : Network
-
node_filesystem_size_bytes , node_filesystem_avail_bytes : Disk space
Custom Application Exporter (Python)
app_exporter.py
from prometheus_client import start_http_server, Counter, Gauge, Histogram, Summary import time import random
Define metrics
REQUEST_COUNT = Counter( 'app_requests_total', 'Total app requests', ['method', 'endpoint', 'status'] )
REQUEST_DURATION = Histogram( 'app_request_duration_seconds', 'Request duration in seconds', ['method', 'endpoint'], buckets=[0.01, 0.05, 0.1, 0.5, 1.0, 2.5, 5.0, 10.0] )
ACTIVE_USERS = Gauge( 'app_active_users', 'Number of active users' )
QUEUE_SIZE = Gauge( 'app_queue_size', 'Current queue size', ['queue_name'] )
DATABASE_CONNECTIONS = Gauge( 'app_database_connections', 'Number of database connections', ['pool', 'state'] )
CACHE_HITS = Counter( 'app_cache_hits_total', 'Total cache hits', ['cache_name'] )
CACHE_MISSES = Counter( 'app_cache_misses_total', 'Total cache misses', ['cache_name'] )
def simulate_metrics(): """Simulate application metrics""" while True: # Simulate requests method = random.choice(['GET', 'POST', 'PUT', 'DELETE']) endpoint = random.choice(['/api/users', '/api/products', '/api/orders']) status = random.choice(['200', '200', '200', '400', '500'])
REQUEST_COUNT.labels(method=method, endpoint=endpoint, status=status).inc()
# Simulate request duration
duration = random.uniform(0.01, 2.0)
REQUEST_DURATION.labels(method=method, endpoint=endpoint).observe(duration)
# Update gauges
ACTIVE_USERS.set(random.randint(100, 1000))
QUEUE_SIZE.labels(queue_name='jobs').set(random.randint(0, 50))
QUEUE_SIZE.labels(queue_name='emails').set(random.randint(0, 20))
# Database connection pool
DATABASE_CONNECTIONS.labels(pool='main', state='active').set(random.randint(5, 20))
DATABASE_CONNECTIONS.labels(pool='main', state='idle').set(random.randint(10, 30))
# Cache metrics
if random.random() > 0.3:
CACHE_HITS.labels(cache_name='redis').inc()
else:
CACHE_MISSES.labels(cache_name='redis').inc()
time.sleep(1)
if name == 'main': # Start metrics server on port 8000 start_http_server(8000) print("Metrics server started on port 8000") simulate_metrics()
Custom Exporter (Go)
package main
import ( "log" "math/rand" "net/http" "time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
var ( requestsTotal = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "app_requests_total", Help: "Total number of requests", }, []string{"method", "endpoint", "status"}, )
requestDuration = prometheus.NewHistogramVec(
prometheus.HistogramOpts{
Name: "app_request_duration_seconds",
Help: "Request duration in seconds",
Buckets: prometheus.ExponentialBuckets(0.01, 2, 10),
},
[]string{"method", "endpoint"},
)
activeUsers = prometheus.NewGauge(
prometheus.GaugeOpts{
Name: "app_active_users",
Help: "Number of active users",
},
)
databaseConnections = prometheus.NewGaugeVec(
prometheus.GaugeOpts{
Name: "app_database_connections",
Help: "Number of database connections",
},
[]string{"pool", "state"},
)
)
func init() { prometheus.MustRegister(requestsTotal) prometheus.MustRegister(requestDuration) prometheus.MustRegister(activeUsers) prometheus.MustRegister(databaseConnections) }
func simulateMetrics() { ticker := time.NewTicker(1 * time.Second) defer ticker.Stop()
for range ticker.C {
// Simulate requests
methods := []string{"GET", "POST", "PUT", "DELETE"}
endpoints := []string{"/api/users", "/api/products", "/api/orders"}
statuses := []string{"200", "200", "200", "400", "500"}
method := methods[rand.Intn(len(methods))]
endpoint := endpoints[rand.Intn(len(endpoints))]
status := statuses[rand.Intn(len(statuses))]
requestsTotal.WithLabelValues(method, endpoint, status).Inc()
requestDuration.WithLabelValues(method, endpoint).Observe(rand.Float64() * 2)
// Update gauges
activeUsers.Set(float64(rand.Intn(900) + 100))
databaseConnections.WithLabelValues("main", "active").Set(float64(rand.Intn(15) + 5))
databaseConnections.WithLabelValues("main", "idle").Set(float64(rand.Intn(20) + 10))
}
}
func main() { go simulateMetrics()
http.Handle("/metrics", promhttp.Handler())
log.Println("Starting metrics server on :8000")
log.Fatal(http.ListenAndServe(":8000", nil))
}
PostgreSQL Exporter
docker-compose.yml for postgres_exporter
version: '3.8' services: postgres-exporter: image: prometheuscommunity/postgres-exporter environment: DATA_SOURCE_NAME: "postgresql://user:password@postgres:5432/dbname?sslmode=disable" ports: - "9187:9187" command: - '--collector.stat_statements' - '--collector.stat_database' - '--collector.replication'
Key PostgreSQL Metrics:
-
pg_up : Database reachability
-
pg_stat_database_tup_returned : Rows read
-
pg_stat_database_tup_inserted : Rows inserted
-
pg_stat_database_deadlocks : Deadlock count
-
pg_stat_replication_lag : Replication lag in seconds
-
pg_locks_count : Active locks
Blackbox Exporter for Probing
blackbox.yml
modules: http_2xx: prober: http timeout: 5s http: valid_http_versions: ["HTTP/1.1", "HTTP/2.0"] valid_status_codes: [200] method: GET preferred_ip_protocol: "ip4"
http_post_json: prober: http http: method: POST headers: Content-Type: application/json body: '{"key":"value"}' valid_status_codes: [200, 201]
tcp_connect: prober: tcp timeout: 5s
icmp: prober: icmp timeout: 5s icmp: preferred_ip_protocol: "ip4"
Prometheus config for blackbox exporter
scrape_configs:
- job_name: 'blackbox-http'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com/health relabel_configs:
- source_labels: [address] target_label: __param_target
- source_labels: [__param_target] target_label: instance
- target_label: address replacement: blackbox-exporter:9115
- targets:
Best Practices
Metric Naming Conventions
Follow Prometheus naming best practices:
Format: <namespace><subsystem><metric>_<unit>
Good examples
http_requests_total # Counter http_request_duration_seconds # Histogram database_connections_active # Gauge cache_hits_total # Counter memory_usage_bytes # Gauge
Include unit suffixes
_seconds, _bytes, _total, _ratio, _percentage
Avoid
RequestCount # Use snake_case http_requests # Missing _total for counter request_time # Missing unit (should be _seconds)
Label Guidelines
Good: Low cardinality labels
http_requests_total{method="GET", endpoint="/api/users", status="200"}
Bad: High cardinality labels (avoid)
http_requests_total{user_id="12345", session_id="abc-def-ghi"}
Good: Use bounded label values
http_requests_total{status_class="2xx"}
Bad: Unbounded label values
http_requests_total{response_size="1234567"}
Recording Rule Patterns
groups:
-
name: performance_rules interval: 30s rules:
Pre-aggregate expensive queries
- record: job:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (job)
Namespace aggregations
- record: namespace:http_requests:rate5m expr: sum(rate(http_requests_total[5m])) by (namespace)
SLI calculations
-
record: job:http_requests_success:rate5m expr: sum(rate(http_requests_total{status=~"2.."}[5m])) by (job)
-
record: job:http_requests_error_rate:ratio expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job)
Alert Design Principles
-
Alert on symptoms, not causes: Alert on user-facing issues
-
Make alerts actionable: Include runbook links
-
Use appropriate severity levels: Critical, warning, info
-
Set proper thresholds: Based on historical data
-
Include context in annotations: Help on-call engineers
-
Group related alerts: Reduce alert fatigue
-
Use inhibition rules: Suppress redundant alerts
-
Test alert rules: Verify they fire when expected
Dashboard Best Practices
-
One dashboard per audience: SRE, developers, business
-
Use consistent time ranges: Make comparisons easier
-
Include SLI/SLO metrics: Show business impact
-
Add annotations for deploys: Correlate changes with metrics
-
Use template variables: Make dashboards reusable
-
Show trends and aggregates: Not just raw metrics
-
Include links to runbooks: Enable quick response
-
Use appropriate visualizations: Graphs, gauges, tables
High Availability Setup
Prometheus HA with Thanos
Deploy multiple Prometheus instances with same config
Use Thanos to deduplicate and provide global view
prometheus-1.yml
global: external_labels: cluster: 'prod' replica: '1'
prometheus-2.yml
global: external_labels: cluster: 'prod' replica: '2'
Thanos sidecar configuration
Uploads blocks to object storage
Provides StoreAPI for querying
Capacity Planning Queries
Disk space exhaustion prediction
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0
Memory growth trend
predict_linear(node_memory_MemAvailable_bytes[1h], 24 * 3600)
Request rate growth
predict_linear(sum(rate(http_requests_total[1h]))[24h:1h], 7 * 24 * 3600)
Storage capacity planning
prometheus_tsdb_storage_blocks_bytes / (30 * 24 * 3600)
Advanced Patterns
Federation for Multi-Cluster Monitoring
Global Prometheus federating from cluster Prometheus instances
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{name=~"job:.*"}' # All recording rules
static_configs:
- targets:
- 'prometheus-us-west:9090'
- 'prometheus-us-east:9090'
- 'prometheus-eu-central:9090'
- targets:
Cost Monitoring Pattern
Track cloud costs with custom metrics
groups:
- name: cost_tracking
rules:
-
record: cloud:cost:hourly_rate expr: | ( sum(kube_pod_container_resource_requests{resource="cpu"}) * 0.03 # CPU cost/hour + sum(kube_pod_container_resource_requests{resource="memory"} / 1024 / 1024 / 1024) * 0.005 # Memory cost/hour )
-
record: cloud:cost:monthly_estimate expr: cloud:cost:hourly_rate * 730 # Hours in average month
-
Custom SLO Implementation
SLO: 99.9% availability for API
groups:
-
name: api_slo interval: 30s rules:
Success rate SLI
- record: api:sli:success_rate expr: | sum(rate(http_requests_total{job="api",status=~"2.."}[5m])) / sum(rate(http_requests_total{job="api"}[5m]))
Error budget remaining (30 days)
- record: api:error_budget:remaining expr: | 1 - ( (1 - api:sli:success_rate) / (1 - 0.999) )
Latency SLI (p99 < 500ms)
- record: api:sli:latency_success_rate expr: | ( histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="api"}[5m])) by (le) ) < 0.5 )
Examples Summary
This skill includes 20+ comprehensive examples covering:
-
Prometheus configuration (basic, Kubernetes SD, storage)
-
PromQL queries (instant vectors, range vectors, aggregations)
-
Mathematical operations and advanced patterns
-
Alert rule definitions (error rate, latency, resource usage)
-
Alertmanager configuration (routing, receivers, inhibition)
-
Multi-window multi-burn-rate SLO alerts
-
Grafana dashboard JSON (full dashboard, RED method, USE method)
-
Custom exporters (Python, Go)
-
Third-party exporters (PostgreSQL, Blackbox)
-
Recording rules for performance
-
Federation for multi-cluster monitoring
-
Cost monitoring and SLO implementation
-
High availability patterns
-
Capacity planning queries
Skill Version: 1.0.0 Last Updated: October 2025 Skill Category: Observability, Monitoring, SRE, DevOps Compatible With: Prometheus, Grafana, Alertmanager, OpenTelemetry, Kubernetes