Kafka Monitoring & Observability
Expert guidance for implementing comprehensive monitoring and observability for Apache Kafka using Prometheus and Grafana.
When to Use This Skill
I activate when you need help with:
-
Monitoring setup: "Set up Kafka monitoring", "configure Prometheus for Kafka", "Grafana dashboards for Kafka"
-
Metrics collection: "Kafka JMX metrics", "export Kafka metrics to Prometheus"
-
Alerting: "Kafka alerting rules", "alert on under-replicated partitions", "critical Kafka metrics"
-
Troubleshooting: "Monitor Kafka performance", "track consumer lag", "broker health monitoring"
What I Know
Available Monitoring Components
This plugin provides a complete monitoring stack:
- Prometheus JMX Exporter Configuration
-
Location: plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml
-
Purpose: Export Kafka JMX metrics to Prometheus format
-
Metrics Exported:
-
Broker topic metrics (bytes in/out, messages in, request rate)
-
Replica manager (under-replicated partitions, ISR shrinks/expands)
-
Controller metrics (active controller, offline partitions, leader elections)
-
Request metrics (produce/fetch latency)
-
Log metrics (flush rate, flush latency)
-
JVM metrics (heap, GC, threads, file descriptors)
- Grafana Dashboards (5 Dashboards)
-
Location: plugins/specweave-kafka/monitoring/grafana/dashboards/
-
Dashboards:
-
kafka-cluster-overview.json - Cluster health and throughput
-
kafka-broker-metrics.json - Per-broker performance
-
kafka-consumer-lag.json - Consumer lag monitoring
-
kafka-topic-metrics.json - Topic-level metrics
-
kafka-jvm-metrics.json - JVM health (heap, GC, threads)
- Grafana Provisioning
-
Location: plugins/specweave-kafka/monitoring/grafana/provisioning/
-
Files:
-
dashboards/kafka.yml
-
Dashboard provisioning config
-
datasources/prometheus.yml
-
Prometheus datasource config
Setup Workflow 1: JMX Exporter (Self-Hosted Kafka)
For Kafka running on VMs or bare metal (non-Kubernetes).
Step 1: Download JMX Prometheus Agent
Download JMX Prometheus agent JAR
cd /opt wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
Copy JMX Exporter config
cp plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml /opt/kafka-jmx-exporter.yml
Step 2: Configure Kafka Broker
Add JMX exporter to Kafka startup script:
Edit Kafka startup (e.g., /etc/systemd/system/kafka.service)
[Service] Environment="KAFKA_OPTS=-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
Or add to kafka-server-start.sh :
export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
Step 3: Restart Kafka and Verify
Restart Kafka broker
sudo systemctl restart kafka
Verify JMX exporter is running (port 7071)
curl localhost:7071/metrics | grep kafka_server
Expected output: kafka_server_broker_topic_metrics_bytesin_total{...} 12345
Step 4: Configure Prometheus Scraping
Add Kafka brokers to Prometheus config:
prometheus.yml
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets:
- 'kafka-broker-1:7071'
- 'kafka-broker-2:7071'
- 'kafka-broker-3:7071' scrape_interval: 30s
- targets:
Reload Prometheus
sudo systemctl reload prometheus
OR send SIGHUP
kill -HUP $(pidof prometheus)
Verify scraping
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
Setup Workflow 2: Strimzi (Kubernetes)
For Kafka running on Kubernetes with Strimzi Operator.
Step 1: Create JMX Exporter ConfigMap
Create ConfigMap from JMX exporter config
kubectl create configmap kafka-metrics
--from-file=kafka-metrics-config.yml=plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml
-n kafka
Step 2: Configure Kafka CR with Metrics
kafka-cluster.yaml (add metricsConfig section)
apiVersion: kafka.strimzi.io/v1beta2 kind: Kafka metadata: name: my-kafka-cluster namespace: kafka spec: kafka: version: 3.7.0 replicas: 3
# ... other config ...
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
Apply updated Kafka CR
kubectl apply -f kafka-cluster.yaml
Verify metrics endpoint (wait for rolling restart)
kubectl exec -it kafka-my-kafka-cluster-0 -n kafka -- curl localhost:9404/metrics | grep kafka_server
Step 3: Install Prometheus Operator (if not installed)
Add Prometheus Community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update
Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack
--namespace monitoring
--create-namespace
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
Step 4: Create PodMonitor for Kafka
kafka-podmonitor.yaml
apiVersion: monitoring.coreos.com/v1 kind: PodMonitor metadata: name: kafka-metrics namespace: kafka labels: app: strimzi spec: selector: matchLabels: strimzi.io/kind: Kafka podMetricsEndpoints: - port: tcp-prometheus interval: 30s
Apply PodMonitor
kubectl apply -f kafka-podmonitor.yaml
Verify Prometheus is scraping Kafka
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
Open: http://localhost:9090/targets
Should see kafka-metrics/* targets
Setup Workflow 3: Grafana Dashboards
Installation (Docker Compose)
If using Docker Compose for local development:
docker-compose.yml (add to existing Kafka setup)
version: '3.8' services:
... Kafka services ...
prometheus: image: prom/prometheus:v2.48.0 ports: - "9090:9090" volumes: - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus'
grafana: image: grafana/grafana:10.2.0 ports: - "3000:3000" environment: - GF_SECURITY_ADMIN_PASSWORD=admin volumes: - ./monitoring/grafana/provisioning:/etc/grafana/provisioning - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards - grafana-data:/var/lib/grafana
volumes: prometheus-data: grafana-data:
Start monitoring stack
docker-compose up -d prometheus grafana
Access Grafana
URL: http://localhost:3000
Username: admin
Password: admin
Installation (Kubernetes)
Dashboards are auto-provisioned if using kube-prometheus-stack:
Create ConfigMaps for each dashboard
for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
name=$(basename "$dashboard" .json)
kubectl create configmap "kafka-dashboard-$name"
--from-file="$dashboard"
-n monitoring
--dry-run=client -o yaml | kubectl apply -f -
done
Label ConfigMaps for Grafana auto-discovery
kubectl label configmap -n monitoring kafka-dashboard-* grafana_dashboard=1
Grafana will auto-import dashboards (wait 30-60 seconds)
Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
URL: http://localhost:3000
Username: admin
Password: prom-operator (default kube-prometheus-stack password)
Manual Dashboard Import
If auto-provisioning doesn't work:
1. Access Grafana UI
2. Go to: Dashboards → Import
3. Upload JSON files from:
plugins/specweave-kafka/monitoring/grafana/dashboards/
Or use Grafana API
for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db
-H "Content-Type: application/json"
-d @"$dashboard"
done
Dashboard Overview
- Kafka Cluster Overview (kafka-cluster-overview.json )
Purpose: High-level cluster health
Key Metrics:
-
Active Controller Count (should be exactly 1)
-
Under-Replicated Partitions (should be 0) ⚠️ CRITICAL
-
Offline Partitions Count (should be 0) ⚠️ CRITICAL
-
Unclean Leader Elections (should be 0)
-
Cluster Throughput (bytes in/out per second)
-
Request Rate (produce, fetch requests per second)
-
ISR Changes (shrinks/expands)
-
Leader Election Rate
Use When: Checking overall cluster health
- Kafka Broker Metrics (kafka-broker-metrics.json )
Purpose: Per-broker performance
Key Metrics:
-
Broker CPU Usage (% utilization)
-
Broker Heap Memory Usage
-
Broker Network Throughput (bytes in/out)
-
Request Handler Idle Percentage (low = CPU saturation)
-
File Descriptors (open vs max)
-
Log Flush Latency (p50, p99)
-
JVM GC Collection Count/Time
Use When: Investigating broker performance issues
- Kafka Consumer Lag (kafka-consumer-lag.json )
Purpose: Consumer lag monitoring
Key Metrics:
-
Consumer Lag per Topic/Partition
-
Total Lag per Consumer Group
-
Offset Commit Rate
-
Current Consumer Offset
-
Log End Offset (producer offset)
-
Consumer Group Members
Use When: Troubleshooting slow consumers or lag spikes
- Kafka Topic Metrics (kafka-topic-metrics.json )
Purpose: Topic-level metrics
Key Metrics:
-
Messages Produced per Topic
-
Bytes per Topic (in/out)
-
Partition Count per Topic
-
Replication Factor
-
In-Sync Replicas
-
Log Size per Partition
-
Current Offset per Partition
-
Partition Leader Distribution
Use When: Analyzing topic throughput and hotspots
- Kafka JVM Metrics (kafka-jvm-metrics.json )
Purpose: JVM health monitoring
Key Metrics:
-
Heap Memory Usage (used vs max)
-
Heap Utilization Percentage
-
GC Collection Rate (collections/sec)
-
GC Collection Time (ms/sec)
-
JVM Thread Count
-
Heap Memory by Pool (young gen, old gen, survivor)
-
Off-Heap Memory Usage (metaspace, code cache)
-
GC Pause Time Percentiles (p50, p95, p99)
Use When: Investigating memory leaks or GC pauses
Critical Alerts Configuration
Create Prometheus alerting rules for critical Kafka metrics:
kafka-alerts.yml
apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: name: kafka-alerts namespace: monitoring spec: groups: - name: kafka.rules interval: 30s rules: # CRITICAL: Under-Replicated Partitions - alert: KafkaUnderReplicatedPartitions expr: sum(kafka_server_replica_manager_under_replicated_partitions) > 0 for: 5m labels: severity: critical annotations: summary: "Kafka has under-replicated partitions" description: "{{ $value }} partitions are under-replicated. Data loss risk!"
# CRITICAL: Offline Partitions
- alert: KafkaOfflinePartitions
expr: kafka_controller_offline_partitions_count > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka has offline partitions"
description: "{{ $value }} partitions are offline. Service degradation!"
# CRITICAL: No Active Controller
- alert: KafkaNoActiveController
expr: kafka_controller_active_controller_count == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No active Kafka controller"
description: "Cluster has no active controller. Cannot perform administrative operations!"
# WARNING: High Consumer Lag
- alert: KafkaConsumerLagHigh
expr: sum by (consumergroup) (kafka_consumergroup_lag) > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Consumer group {{ $labels.consumergroup }} has high lag"
description: "Lag is {{ $value }} messages. Consumers may be slow."
# WARNING: High CPU Usage
- alert: KafkaBrokerHighCPU
expr: os_process_cpu_load{job="kafka"} > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} has high CPU usage"
description: "CPU usage is {{ $value | humanizePercentage }}. Consider scaling."
# WARNING: Low Heap Memory
- alert: KafkaBrokerLowHeapMemory
expr: jvm_memory_heap_used_bytes{job="kafka"} / jvm_memory_heap_max_bytes{job="kafka"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} has low heap memory"
description: "Heap usage is {{ $value | humanizePercentage }}. Risk of OOM!"
# WARNING: High GC Time
- alert: KafkaBrokerHighGCTime
expr: rate(jvm_gc_collection_time_ms_total{job="kafka"}[5m]) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} spending too much time in GC"
description: "GC time is {{ $value }}ms/sec. Application pauses likely."
Apply alerts (Kubernetes)
kubectl apply -f kafka-alerts.yml
Verify alerts loaded
kubectl get prometheusrules -n monitoring
Troubleshooting
"Prometheus not scraping Kafka metrics"
Symptoms: No Kafka metrics in Prometheus
Fix:
1. Verify JMX exporter is running
curl http://kafka-broker:7071/metrics
2. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
3. Check Prometheus logs
kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0
Common issues:
- Firewall blocking port 7071
- Incorrect scrape config
- Kafka broker not running
"Grafana dashboards not loading"
Symptoms: Dashboards show "No data"
Fix:
1. Verify Prometheus datasource
Grafana UI → Configuration → Data Sources → Prometheus → Test
2. Check if Kafka metrics exist in Prometheus
Prometheus UI → Graph → Enter: kafka_server_broker_topic_metrics_bytesin_total
3. Verify dashboard queries match your Prometheus job name
Dashboard panels use job="kafka" by default
If your job name is different, update dashboard JSON
"Consumer lag metrics missing"
Symptoms: Consumer lag dashboard empty
Fix: Consumer lag metrics require Kafka Exporter (separate from JMX Exporter):
Install Kafka Exporter (Kubernetes)
helm install kafka-exporter prometheus-community/prometheus-kafka-exporter
--namespace monitoring
--set kafkaServer={kafka-bootstrap:9092}
Or run as Docker container
docker run -d -p 9308:9308
danielqsj/kafka-exporter
--kafka.server=kafka:9092
--web.listen-address=:9308
Add to Prometheus scrape config
scrape_configs:
- job_name: 'kafka-exporter'
static_configs:
- targets: ['kafka-exporter:9308']
Integration with Other Skills
-
kafka-iac-deployment: Set up monitoring during Terraform deployment
-
kafka-kubernetes: Configure monitoring for Strimzi Kafka on K8s
-
kafka-architecture: Use cluster sizing metrics to validate capacity planning
-
kafka-cli-tools: Use kcat to generate test traffic and verify metrics
Quick Reference Commands
Check JMX exporter metrics
curl http://localhost:7071/metrics | grep -E "(kafka_server|kafka_controller)"
Prometheus query examples
curl -g 'http://localhost:9090/api/v1/query?query=kafka_server_replica_manager_under_replicated_partitions'
Grafana dashboard export
curl http://admin:admin@localhost:3000/api/dashboards/uid/kafka-cluster-overview | jq .dashboard > backup.json
Reload Prometheus config
kill -HUP $(pidof prometheus)
Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
Next Steps After Monitoring Setup:
-
Review all 5 Grafana dashboards to familiarize yourself with metrics
-
Set up alerting (Slack, PagerDuty, email)
-
Create runbooks for critical alerts (under-replicated partitions, offline partitions, no controller)
-
Monitor for 7 days to establish baseline metrics
-
Tune JVM settings based on GC metrics