prometheus-grafana

Prometheus & Grafana

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "prometheus-grafana" with this command: npx skills add bagelhole/devops-security-agent-skills/bagelhole-devops-security-agent-skills-prometheus-grafana

Prometheus & Grafana

Collect metrics and visualize system performance with the Prometheus-Grafana stack.

When to Use This Skill

Use this skill when:

  • Setting up metrics collection infrastructure

  • Creating monitoring dashboards

  • Writing PromQL queries for analysis

  • Configuring alerting rules

  • Monitoring Kubernetes clusters

Prerequisites

  • Docker or Kubernetes for deployment

  • Network access to monitored targets

  • Basic understanding of metrics concepts

Prometheus Setup

Docker Deployment

docker-compose.yml

version: '3.8'

services: prometheus: image: prom/prometheus:v2.48.0 ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - ./rules:/etc/prometheus/rules - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=15d'

grafana: image: grafana/grafana:10.2.0 ports: - "3000:3000" volumes: - grafana-data:/var/lib/grafana environment: - GF_SECURITY_ADMIN_PASSWORD=admin

volumes: prometheus-data: grafana-data:

Configuration

prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

rule_files:

  • /etc/prometheus/rules/*.yml

scrape_configs:

  • job_name: 'prometheus' static_configs:

    • targets: ['localhost:9090']
  • job_name: 'node' static_configs:

    • targets:
      • 'node-exporter:9100'
  • job_name: 'applications' static_configs:

    • targets:
      • 'app1:8080'
      • 'app2:8080' metrics_path: /metrics

Kubernetes Deployment

Using Helm

Add Prometheus community Helm repo

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

Install kube-prometheus-stack

helm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set grafana.adminPassword=admin

ServiceMonitor

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: myapp namespace: monitoring spec: selector: matchLabels: app: myapp endpoints: - port: metrics interval: 30s path: /metrics namespaceSelector: matchNames: - default

PromQL Queries

Basic Queries

Current CPU usage

node_cpu_seconds_total{mode="idle"}

Rate of HTTP requests per second

rate(http_requests_total[5m])

Average response time

avg(http_request_duration_seconds_sum / http_request_duration_seconds_count)

Memory usage percentage

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Aggregations

Sum requests by status code

sum by (status_code) (rate(http_requests_total[5m]))

Average CPU by instance

avg by (instance) (rate(node_cpu_seconds_total{mode!="idle"}[5m]))

Top 5 endpoints by request count

topk(5, sum by (endpoint) (rate(http_requests_total[5m])))

95th percentile latency

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

Time-Based Queries

Compare to 1 hour ago

http_requests_total - http_requests_total offset 1h

Predict disk space in 4 hours

predict_linear(node_filesystem_avail_bytes[1h], 4 * 3600)

Changes in last 5 minutes

changes(up[5m])

Average over 24 hours

avg_over_time(http_requests_total[24h])

Alerting Rules

rules/alerts.yml

groups:

  • name: application rules:
    • alert: HighErrorRate expr: | sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate detected" description: "Error rate is {{ $value | humanizePercentage }}"

    • alert: ServiceDown expr: up == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down"

    • alert: HighMemoryUsage expr: | (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.9 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value | humanizePercentage }}"

    • alert: DiskSpaceLow expr: | (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1 for: 5m labels: severity: warning annotations: summary: "Disk space low on {{ $labels.instance }}"

Alertmanager

alertmanager.yml

global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/xxx'

route: receiver: 'slack-notifications' group_by: ['alertname', 'severity'] group_wait: 30s group_interval: 5m repeat_interval: 4h routes: - match: severity: critical receiver: 'pagerduty'

receivers:

  • name: 'slack-notifications' slack_configs:

    • channel: '#alerts' send_resolved: true title: '{{ .Status | toUpper }}: {{ .CommonAnnotations.summary }}' text: '{{ .CommonAnnotations.description }}'
  • name: 'pagerduty' pagerduty_configs:

    • service_key: 'xxx' severity: critical

Grafana Dashboards

Dashboard JSON Structure

{ "dashboard": { "title": "Application Metrics", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(http_requests_total[5m])) by (status_code)", "legendFormat": "{{ status_code }}" } ], "gridPos": {"x": 0, "y": 0, "w": 12, "h": 8} }, { "title": "Latency P95", "type": "gauge", "targets": [ { "expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))" } ], "gridPos": {"x": 12, "y": 0, "w": 6, "h": 8} } ] } }

Provisioning Dashboards

grafana/provisioning/dashboards/dashboards.yml

apiVersion: 1

providers:

  • name: 'default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 30 options: path: /var/lib/grafana/dashboards

Data Source Provisioning

grafana/provisioning/datasources/prometheus.yml

apiVersion: 1

datasources:

Recording Rules

rules/recording.yml

groups:

  • name: aggregations interval: 30s rules:
    • record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m]))

    • record: instance:node_cpu:avg_rate5m expr: | avg by (instance) ( rate(node_cpu_seconds_total{mode!="idle"}[5m]) )

    • record: job:http_latency:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) )

Application Instrumentation

Go Application

import ( "github.com/prometheus/client_golang/prometheus" "github.com/prometheus/client_golang/prometheus/promhttp" )

var httpRequests = prometheus.NewCounterVec( prometheus.CounterOpts{ Name: "http_requests_total", Help: "Total HTTP requests", }, []string{"method", "endpoint", "status"}, )

func init() { prometheus.MustRegister(httpRequests) }

// Expose metrics endpoint http.Handle("/metrics", promhttp.Handler())

Node.js Application

const client = require('prom-client');

const httpRequests = new client.Counter({ name: 'http_requests_total', help: 'Total HTTP requests', labelNames: ['method', 'endpoint', 'status'] });

// Middleware app.use((req, res, next) => { res.on('finish', () => { httpRequests.inc({ method: req.method, endpoint: req.path, status: res.statusCode }); }); next(); });

// Expose metrics app.get('/metrics', async (req, res) => { res.set('Content-Type', client.register.contentType); res.end(await client.register.metrics()); });

Common Issues

Issue: Targets Not Discovered

Problem: Prometheus not scraping targets Solution: Check network connectivity, verify target labels

Issue: High Memory Usage

Problem: Prometheus using excessive memory Solution: Reduce retention, use recording rules, limit cardinality

Issue: Slow Queries

Problem: PromQL queries timing out Solution: Use recording rules, limit time ranges, optimize queries

Issue: Missing Data Points

Problem: Gaps in metrics data Solution: Check scrape interval, verify target availability

Best Practices

  • Use recording rules for frequently-used queries

  • Limit label cardinality to prevent memory issues

  • Set appropriate retention based on storage capacity

  • Use histogram metrics for latency measurement

  • Implement proper alerting thresholds

  • Version control dashboards as code

  • Use federation for large-scale deployments

  • Regularly review and prune unused metrics

Related Skills

  • alerting-oncall - Alert management

  • loki-logging - Log aggregation

  • kubernetes-ops - K8s monitoring

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Security

sops-encryption

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-administration

No summary provided by upstream source.

Repository SourceNeeds Review
Security

linux-hardening

No summary provided by upstream source.

Repository SourceNeeds Review