monitoring-authoring

Monitoring Resource Authoring

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "monitoring-authoring" with this command: npx skills add ionfury/homelab/ionfury-homelab-monitoring-authoring

Monitoring Resource Authoring

This skill covers creating and modifying monitoring resources. For querying Prometheus or investigating alerts, see the prometheus skill and sre skill.

Resource Types Overview

Resource API Group Purpose CRD Provider

PrometheusRule

monitoring.coreos.com/v1

Alert rules and recording rules kube-prometheus-stack

ServiceMonitor

monitoring.coreos.com/v1

Scrape metrics from Services kube-prometheus-stack

PodMonitor

monitoring.coreos.com/v1

Scrape metrics from Pods directly kube-prometheus-stack

ScrapeConfig

monitoring.coreos.com/v1alpha1

Advanced scrape configuration (relabeling, multi-target) kube-prometheus-stack

AlertmanagerConfig

monitoring.coreos.com/v1alpha1

Routing, receivers, silencing kube-prometheus-stack

Silence

observability.giantswarm.io/v1alpha2

Declarative Alertmanager silences silence-operator

Canary

canaries.flanksource.com/v1

Synthetic health checks (HTTP, TCP, K8s) canary-checker

File Placement

Monitoring resources go in different locations depending on scope:

Scope Path When to Use

Platform-wide alerts/monitors kubernetes/platform/config/monitoring/

Alerts for platform components (Cilium, Istio, cert-manager, etc.)

Subsystem-specific alerts kubernetes/platform/config/<subsystem>/

Alerts bundled with the subsystem they monitor (e.g., dragonfly/prometheus-rules.yaml )

Cluster-specific silences kubernetes/clusters/<cluster>/config/silences/

Silences for known issues on specific clusters

Cluster-specific alerts kubernetes/clusters/<cluster>/config/

Alerts that only apply to a specific cluster

Canary health checks kubernetes/platform/config/canary-checker/

Platform-wide synthetic checks

File Naming Conventions

Observed patterns in the config/monitoring/ directory:

Pattern Example When

<component>-alerts.yaml

cilium-alerts.yaml , grafana-alerts.yaml

PrometheusRule files

<component>-recording-rules.yaml

loki-mixin-recording-rules.yaml

Recording rules

<component>-servicemonitors.yaml

istio-servicemonitors.yaml

ServiceMonitor/PodMonitor files

<component>-canary.yaml

alertmanager-canary.yaml

Canary health checks

<component>-route.yaml

grafana-route.yaml

HTTPRoute for gateway access

<component>-secret.yaml

discord-secret.yaml

ExternalSecrets for monitoring

<component>-scrape.yaml

hardware-monitoring-scrape.yaml

ScrapeConfig resources

Registration

After creating a file in config/monitoring/ , add it to the kustomization:

kubernetes/platform/config/monitoring/kustomization.yaml

resources:

  • ...existing resources...
  • my-new-alerts.yaml # Add alphabetically by component

For subsystem-specific alerts (e.g., config/dragonfly/prometheus-rules.yaml ), add to that subsystem's kustomization.yaml instead.

PrometheusRule Authoring

Required Structure

Every PrometheusRule must include the release: kube-prometheus-stack label for Prometheus to discover it. The YAML schema comment enables editor validation.


yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/prometheusrule_v1.json

apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: <component>-alerts labels: app.kubernetes.io/name: <component> release: kube-prometheus-stack # REQUIRED - Prometheus selector spec: groups: - name: <component>.rules # or <component>-<subsystem> for sub-groups rules: - alert: AlertName expr: <PromQL expression> for: 5m labels: severity: critical # critical | warning | info annotations: summary: "Short human-readable summary with {{ $labels.instance }}" description: >- Detailed explanation of what is happening, what it means, and what to investigate. Use template variables for context.

Label Requirements

Label Required Purpose

release: kube-prometheus-stack

Yes Prometheus discovery selector

app.kubernetes.io/name: <component>

Recommended Organizational grouping

Some files use additional labels like prometheus: kube-prometheus-stack (e.g., dragonfly), but release: kube-prometheus-stack is the critical one for discovery.

Severity Conventions

Severity for Duration Use Case Alertmanager Routing

critical

2m-5m Service down, data loss risk, immediate action needed Routed to Discord

warning

5m-15m Degraded performance, approaching limits, needs attention Default receiver (Discord)

info

10m-30m Informational, capacity planning, non-urgent Silenced by InfoInhibitor

Guidelines for for duration:

  • Shorter for = faster alert, more noise. Longer = quieter, slower response.

  • for: 0m (immediate) only for truly instant failures (e.g., SMART health check fail).

  • Most alerts: 5m is a good default.

  • Flap-prone metrics (error rates, latency): 10m-15m to avoid false positives.

  • Absence detection: 5m (metric may genuinely disappear briefly during restarts).

Annotation Templates

Standard annotations used across this repository:

annotations: summary: "Short title with {{ $labels.relevant_label }}" description: >- Multi-line description explaining what happened, the impact, and what to investigate. Reference threshold values and current values using template functions. runbook_url: "https://github.com/ionfury/homelab/blob/main/docs/runbooks/&#x3C;runbook>.md"

The runbook_url annotation is optional but recommended for critical alerts that have established recovery procedures.

PromQL Template Functions

Functions available in summary and description annotations:

Function Input Output Example

humanize

Number Human-readable number {{ $value | humanize }} -> "1.234k"

humanizePercentage

Float (0-1) Percentage string {{ $value | humanizePercentage }} -> "45.6%"

humanizeDuration

Seconds Duration string {{ $value | humanizeDuration }} -> "2h 30m"

printf

Format string Formatted value {{ printf "%.2f" $value }} -> "1.23"

Label Variables in Annotations

Access alert labels via {{ $labels.<label_name> }} and the expression value via {{ $value }} :

summary: "Cilium agent down on {{ $labels.instance }}" description: >- BPF map {{ $labels.map_name }} on {{ $labels.instance }} is at {{ $value | humanizePercentage }}.

Common Alert Patterns

Target down (availability):

  • alert: <Component>Down expr: up{job="<job-name>"} == 0 for: 5m labels: severity: critical annotations: summary: "<Component> is down on {{ $labels.instance }}"

Absence detection (component disappeared entirely):

  • alert: <Component>Down expr: absent(up{job="<job-name>"} == 1) for: 5m labels: severity: critical annotations: summary: "<Component> is unavailable"

Error rate (ratio):

  • alert: <Component>HighErrorRate expr: | ( sum(rate(http_requests_total{job="<job>",status=~"5.."}[5m])) / sum(rate(http_requests_total{job="<job>"}[5m])) ) > 0.05 for: 10m labels: severity: warning annotations: summary: "<Component> error rate above 5%" description: "Error rate is {{ $value | humanizePercentage }}"

Latency (histogram quantile):

  • alert: <Component>HighLatency expr: | histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="<job>"}[5m])) by (le) ) > 1 for: 10m labels: severity: warning annotations: summary: "<Component> p99 latency above 1s" description: "P99 latency is {{ $value | humanizeDuration }}"

Resource pressure (capacity):

  • alert: <Component>ResourcePressure expr: <resource_used> / <resource_total> > 0.9 for: 5m labels: severity: critical annotations: summary: "<Component> at {{ $value | humanizePercentage }} capacity"

PVC space low:

  • alert: <Component>PVCLow expr: | kubelet_volume_stats_available_bytes{persistentvolumeclaim=".<component>."} / kubelet_volume_stats_capacity_bytes{persistentvolumeclaim=".<component>."} < 0.15 for: 15m labels: severity: warning annotations: summary: "PVC {{ $labels.persistentvolumeclaim }} running low" description: "{{ $value | humanizePercentage }} free space remaining"

Alert Grouping

Group related alerts in named rule groups. The name field groups alerts in the Prometheus UI and affects evaluation order:

spec: groups: - name: cilium-agent # Agent availability and health rules: [...] - name: cilium-bpf # BPF subsystem alerts rules: [...] - name: cilium-policy # Network policy alerts rules: [...] - name: cilium-network # General networking alerts rules: [...]

Recording Rules

Recording rules pre-compute expensive queries for dashboard performance. Place them alongside alerts in the same PrometheusRule file or in a dedicated *-recording-rules.yaml file.

spec: groups: - name: <component>-recording-rules rules: - record: <namespace>:<metric>:<aggregation> expr: | <PromQL aggregation query>

Naming Convention

Recording rule names follow the pattern level:metric:operations :

loki:request_duration_seconds:p99 loki:requests_total:rate5m loki:requests_error_rate:ratio5m

When to Create Recording Rules

  • Dashboard queries that aggregate across many series (e.g., sum/rate across all pods)

  • Queries used by multiple alerts (avoids redundant computation)

  • Complex multi-step computations that are hard to read inline

Example: Loki Recording Rules

  • record: loki:request_duration_seconds:p99 expr: | histogram_quantile(0.99, sum(rate(loki_request_duration_seconds_bucket[5m])) by (le, job, namespace) )

  • record: loki:requests_error_rate:ratio5m expr: | sum(rate(loki_request_duration_seconds_count{status_code=~"5.."}[5m])) by (job, namespace) / sum(rate(loki_request_duration_seconds_count[5m])) by (job, namespace)

ServiceMonitor and PodMonitor

Via Helm Values (Preferred)

Most charts support enabling ServiceMonitor through values. Always prefer this over manual resources:

kubernetes/platform/charts/<app-name>.yaml

serviceMonitor: enabled: true interval: 30s scrapeTimeout: 10s

Manual ServiceMonitor

When a chart does not support ServiceMonitor creation, create one manually. The resource lives in the monitoring namespace and uses namespaceSelector to reach across namespaces.


yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/servicemonitor_v1.json

apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # REQUIRED for discovery spec: namespaceSelector: matchNames: - <target-namespace> # Namespace where the service lives selector: matchLabels: app.kubernetes.io/name: <component> # Must match service labels endpoints: - port: http-monitoring # Must match service port name path: /metrics interval: 30s

Manual PodMonitor

Use PodMonitor when pods expose metrics but don't have a Service (e.g., DaemonSets, sidecars):


yaml-language-server: $schema=https://raw.githubusercontent.com/datreeio/CRDs-catalog/main/monitoring.coreos.com/podmonitor_v1.json

apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: <component> namespace: monitoring labels: release: kube-prometheus-stack # REQUIRED for discovery spec: namespaceSelector: matchNames: - <target-namespace> selector: matchLabels: app: <component> podMetricsEndpoints: - port: "15020" # Port name or number (quoted if numeric) path: /stats/prometheus interval: 30s

Cross-Namespace Pattern

All ServiceMonitors and PodMonitors in this repo live in the monitoring namespace and use namespaceSelector to reach pods in other namespaces. This centralizes monitoring configuration and avoids needing release: kube-prometheus-stack labels on resources in app namespaces.

Advanced: matchExpressions

For selecting multiple pod labels (e.g., all Flux controllers):

selector: matchExpressions: - key: app operator: In values: - helm-controller - source-controller - kustomize-controller

AlertmanagerConfig

The platform Alertmanager configuration lives in config/monitoring/alertmanager-config.yaml . It defines routing and receivers for the entire platform.

Current Routing Architecture

All alerts ├── InfoInhibitor → null receiver (silenced) ├── Watchdog → heartbeat receiver (webhook to healthchecks.io, every 2m) └── severity=critical → discord receiver └── (default) → discord receiver

Receivers

Receiver Type Purpose

"null"

None Silences matched alerts (e.g., InfoInhibitor)

heartbeat

Webhook Sends Watchdog heartbeat to healthchecks.io

discord

Discord webhook Sends alerts to Discord channel

Adding a New Route

To route specific alerts differently (e.g., to a different channel or receiver), add a route entry in the alertmanager-config.yaml :

routes:

  • receiver: "<receiver-name>" matchers:
    • name: alertname value: "<AlertName>" matchType: =

Secrets for Alertmanager

Secret Source File

alertmanager-discord-webhook

ExternalSecret (AWS SSM) discord-secret.yaml

alertmanager-heartbeat-ping-url

Replicated from kube-system

heartbeat-secret.yaml

Silence CRs (silence-operator)

Silences suppress known alerts declaratively. They are per-cluster resources because different clusters have different expected alert profiles.

Placement

kubernetes/clusters/<cluster>/config/silences/ ├── kustomization.yaml └── <descriptive-name>.yaml

Template


<Comment explaining WHY this alert is silenced>

apiVersion: observability.giantswarm.io/v1alpha2 kind: Silence metadata: name: <descriptive-name> namespace: monitoring spec: matchers: - name: alertname matchType: "=" # "=" exact, "=" regex, "!=" negation, "!~" regex negation value: "Alert1|Alert2" - name: namespace matchType: "=" value: <target-namespace>

Matcher Reference

matchType Meaning Example

=

Exact match value: "KubePodCrashLooping"

!=

Not equal value: "Watchdog"

=~

Regex match value: "KubePod.*|TargetDown"

!~

Regex negation value: "Info.*"

Requirements

  • Always include a comment explaining why the silence exists (architectural limitation, expected behavior, etc.)

  • Every cluster must maintain a zero firing alerts baseline (excluding Watchdog)

  • Silences are a LAST RESORT — every effort must be made to fix the root cause before resorting to a silence. Only silence when the alert genuinely cannot be fixed: architectural limitations (e.g., single-node Spegel), expected environmental behavior, or confirmed upstream bugs

  • Never leave alerts firing without action — either fix the cause or create a Silence CR. An ignored alert degrades trust in the entire monitoring system and leads to alert fatigue where real incidents get missed

Adding a Silence to a Cluster

  • Create config/silences/ directory if it does not exist

  • Add the Silence YAML file

  • Create or update config/silences/kustomization.yaml : apiVersion: kustomize.config.k8s.io/v1beta1 kind: Kustomization resources:

    • <silence-name>.yaml
  • Reference silences in config/kustomization.yaml

Canary Health Checks

Canary resources provide synthetic monitoring using Flanksource canary-checker. They live in config/canary-checker/ for platform checks or alongside app config for app-specific checks.

HTTP Health Check


yaml-language-server: $schema=https://kubernetes-schemas.pages.dev/canaries.flanksource.com/canary_v1.json

apiVersion: canaries.flanksource.com/v1 kind: Canary metadata: name: http-check-<component> spec: schedule: "@every 1m" http: - name: <component>-health url: https://<component>.${internal_domain}/health responseCodes: [200] maxSSLExpiry: 7 # Alert if TLS cert expires within 7 days thresholdMillis: 5000 # Fail if response takes >5s

TCP Port Check

spec: schedule: "@every 1m" tcp: - name: <component>-port host: <service>.<namespace>.svc.cluster.local port: 8080 timeout: 5000

Kubernetes Resource Check with CEL

Test that pods are actually healthy using CEL expressions (preferred over ready: true

because the built-in flag penalizes pods with restart history):

spec: interval: 60 kubernetes: - name: <component>-pods-healthy kind: Pod namespaceSelector: name: <namespace> resource: labelSelector: app.kubernetes.io/name=<component> test: expr: > dyn(results).all(pod, pod.Object.status.phase == "Running" && pod.Object.status.conditions.exists(c, c.type == "Ready" && c.status == "True") )

Canary Metrics and Alerting

canary-checker exposes metrics that are already monitored by the platform:

  • canary_check == 1 triggers CanaryCheckFailure (critical, 2m)

  • High failure rates trigger CanaryCheckHighFailureRate (warning, 5m)

These alerts are defined in config/canary-checker/prometheus-rules.yaml -- you do not need to create separate alerts for each canary.

Workflow: Adding Monitoring for a New Component

Step 1: Determine What Exists

Check if the Helm chart already provides monitoring:

Search chart values for monitoring options

kubesearch <chart-name> serviceMonitor kubesearch <chart-name> prometheusRule

Enable via Helm values if available (see deploy-app skill).

Step 2: Create Missing Resources

If the chart does not provide monitoring, create resources manually:

  • ServiceMonitor or PodMonitor for metrics scraping

  • PrometheusRule for alert rules

  • Canary for synthetic health checks (HTTP/TCP)

Step 3: Place Files Correctly

  • If the component has its own config subsystem (config/<component>/ ), add monitoring resources there alongside other config

  • If it is a standalone monitoring addition, add to config/monitoring/

Step 4: Register in Kustomization

Add new files to the appropriate kustomization.yaml .

Step 5: Validate

task k8s:validate

Step 6: Verify After Deployment

Prometheus is behind OAuth2 Proxy — use kubectl exec or port-forward for API queries:

Check ServiceMonitor is discovered

KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/targets' |
jq '.data.activeTargets[] | select(.labels.job | contains("<component>"))'

Check alert rules are loaded

KUBECONFIG=~/.kube/<cluster>.yaml kubectl exec -n monitoring prometheus-kube-prometheus-stack-0 -c prometheus --
wget -qO- 'http://localhost:9090/api/v1/rules' |
jq '.data.groups[] | select(.name | contains("<component>"))'

Check canary status

KUBECONFIG=~/.kube/<cluster>.yaml kubectl get canaries -A | grep <component>

Common Mistakes

Mistake Impact Fix

Missing release: kube-prometheus-stack label Prometheus ignores the resource Add the label to metadata.labels

PrometheusRule in wrong namespace without namespaceSelector Prometheus does not discover it Place in monitoring namespace or ensure Prometheus watches the target namespace

ServiceMonitor selector does not match any service No metrics scraped, no error raised Verify labels match with kubectl get svc -n <ns> --show-labels

Using ready: true in canary-checker Kubernetes checks False negatives after pod restarts Use CEL test.expr instead

Hardcoding domains in canary URLs Breaks across clusters Use ${internal_domain} substitution variable

Very short for duration on flappy metrics Alert noise Use 10m+ for error rates and latencies

Creating alerts for metrics that do not exist yet Alert permanently in "pending" state Verify metrics exist in Prometheus before writing rules

Reference: Existing Alert Files

File Component Alert Count Subsystem

monitoring/cilium-alerts.yaml

Cilium 14 Agent, BPF, Policy, Network

monitoring/istio-alerts.yaml

Istio ~10 Control plane, mTLS, Gateway

monitoring/cert-manager-alerts.yaml

cert-manager 5 Expiry, Renewal, Issuance

monitoring/network-policy-alerts.yaml

Network Policy 2 Enforcement escape hatch

monitoring/external-secrets-alerts.yaml

External Secrets 3 Sync, Ready, Store health

monitoring/grafana-alerts.yaml

Grafana 4 Datasource, Errors, Plugins, Down

monitoring/loki-mixin-alerts.yaml

Loki ~5 Requests, Latency, Ingester

monitoring/alloy-alerts.yaml

Alloy 3 Dropped entries, Errors, Lag

monitoring/hardware-monitoring-alerts.yaml

Hardware 7 Temperature, Fans, Disks, Power

dragonfly/prometheus-rules.yaml

Dragonfly 2+ Down, Memory

canary-checker/prometheus-rules.yaml

canary-checker 2 Check failure, High failure rate

Keywords

PrometheusRule, ServiceMonitor, PodMonitor, ScrapeConfig, AlertmanagerConfig, Silence, silence-operator, canary-checker, Canary, recording rules, alert rules, monitoring, observability, scrape targets, prometheus, alertmanager, discord, heartbeat

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

prometheus

No summary provided by upstream source.

Repository SourceNeeds Review
General

opentofu-modules

No summary provided by upstream source.

Repository SourceNeeds Review
General

taskfiles

No summary provided by upstream source.

Repository SourceNeeds Review
General

terragrunt

No summary provided by upstream source.

Repository SourceNeeds Review
monitoring-authoring | V50.AI