Grafana Observability

Full access to Grafana instances (self-hosted or Grafana Cloud) for network infrastructure observability: dashboards, Prometheus metrics (PromQL), Loki logs (LogQL), alerting rules, incident management, OnCall schedules, annotations, and panel image rendering. 75+ tools via the official Grafana MCP server.

MCP Server

Property Value

Source grafana/mcp-grafana

Transport stdio (default), SSE, or streamable-http

Language Go (runs via uvx mcp-grafana )

Tools 75+ (dashboards, Prometheus, Loki, alerting, incidents, OnCall, annotations, admin)

Auth Service account token (preferred) or username/password

Requires Grafana 9.0+, service account with Editor role or granular RBAC

How to Run

stdio mode (default — used by NetClaw)

uvx mcp-grafana

Read-only mode (prevents dashboard/alert modifications)

uvx mcp-grafana --disable-write

Environment Variables

Variable Required Example Description

GRAFANA_URL

Yes http://grafana.example.com:3000

Grafana instance URL

GRAFANA_SERVICE_ACCOUNT_TOKEN

Yes* glsa_abc123...

Service account token (preferred auth)

GRAFANA_USERNAME

Alt admin

Basic auth username (alternative to token)

GRAFANA_PASSWORD

Alt changeme

Basic auth password

GRAFANA_ORG_ID

No 1

Organization ID for multi-org setups

*Either service account token or username/password required.

Key Tool Categories

Dashboard Operations

Tool What It Does

search_dashboards

Find dashboards by title or metadata

get_dashboard_summary

Lightweight overview (context-efficient — use this first)

get_dashboard_by_uid

Full dashboard JSON (large — use sparingly)

get_dashboard_property

Extract specific fields via JSONPath

get_dashboard_panel_queries

Extract panel query details

update_dashboard

Create or modify dashboards

patch_dashboard

Targeted modifications without full JSON replacement

Prometheus (PromQL)

Tool What It Does

query_prometheus

Execute instant or range PromQL queries

list_prometheus_metric_names

Discover available metrics

list_prometheus_label_names

List labels matching selectors

list_prometheus_label_values

Retrieve values for a specific label

query_prometheus_histogram

Calculate percentiles (p50, p90, p95, p99)

list_prometheus_metric_metadata

Metric type, help text, unit

Loki (LogQL)

Tool What It Does

query_loki_logs

Execute LogQL queries against log streams

list_loki_label_names

Discover available log labels

list_loki_label_values

List values for a specific log label

query_loki_stats

Stream statistics (volume, rate)

query_loki_patterns

Detect log structure patterns

Alerting

Tool What It Does

list_alert_rules

View all Grafana and datasource-managed alert rules

get_alert_rule_by_uid

Retrieve specific alert rule details

create_alert_rule

Create new alert rule

update_alert_rule

Modify existing alert rule

delete_alert_rule

Remove alert rule

list_contact_points

View notification endpoints (email, Slack, PagerDuty, etc.)

Incident Management

Tool What It Does

list_incidents

View Grafana Incidents with filtering

get_incident

Single incident details

create_incident

Create a new incident

add_activity_to_incident

Add timeline entry to incident

OnCall

Tool What It Does

list_oncall_schedules

View on-call rotation schedules

get_oncall_shift

Shift details

get_current_oncall_users

Who is on call right now

list_alert_groups

OnCall alert groups with filtering

Annotations & Rendering

Tool What It Does

get_annotations

Query annotations with time/tag filters

create_annotation

Add annotation to dashboard/panel

get_panel_image

Render a panel or dashboard as PNG image

generate_deeplink

Create accurate Grafana URLs for sharing

Investigation (Sift)

Tool What It Does

list_sift_investigations

List automated investigations

get_sift_investigation

Investigation details

find_error_pattern_logs

Detect elevated error patterns in logs

find_slow_requests

Identify slow requests via Tempo traces

Workflow: Network Infrastructure Monitoring

When checking network device metrics in Grafana:

Find dashboards: search_dashboards with keyword (e.g., "network", "interface", "BGP")
Dashboard overview: get_dashboard_summary for panel list without full JSON
Query metrics: query_prometheus with PromQL for specific metrics:
Interface traffic: rate(ifHCInOctets{instance="router1"}[5m]) * 8
BGP peer state: bgp_peer_state{peer="10.1.1.2"}
CPU utilization: device_cpu_utilization{device="core-rtr-01"}
Interface errors: increase(ifInErrors{device=~".*"}[1h])
Check alerts: list_alert_rules to see active alerting thresholds
Search logs: query_loki_logs for syslog or SNMP trap data
Report: Metrics summary with alert status and log correlation
GAIT: Record all queries in audit trail

Example: Interface Utilization Check

search_dashboards(title="Network Interfaces") get_dashboard_summary(uid="abc123") query_prometheus(expr="rate(ifHCInOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h") query_prometheus(expr="rate(ifHCOutOctets{device='core-rtr-01'}[5m]) * 8", time_range="1h") list_alert_rules(folder="Network")

Workflow: Alert Investigation

When investigating Grafana alerts:

List alerts: list_alert_rules — find firing or pending rules
Alert details: get_alert_rule_by_uid — thresholds, conditions, datasource
Query metrics: query_prometheus — check the metric that triggered the alert
Search logs: query_loki_logs — correlate with log events around alert time
Check incidents: list_incidents — is this already tracked?
Contact points: list_contact_points — verify notification routes
Report: Alert analysis with root cause and metric evidence

Workflow: Incident Response

When responding to a Grafana incident:

List incidents: list_incidents — find open incidents
Incident details: get_incident — timeline, severity, labels
OnCall: get_current_oncall_users — who should be notified
Correlate metrics: query_prometheus — check affected service metrics
Correlate logs: query_loki_logs — find error patterns around incident time
Investigate: find_error_pattern_logs — automated error pattern detection
Update incident: add_activity_to_incident — add findings to timeline
Annotate: create_annotation — mark event on relevant dashboards

Workflow: Log Analysis

When investigating network logs stored in Loki:

Discover labels: list_loki_label_names — find available labels (host, severity, facility)
Label values: list_loki_label_values — enumerate hosts, severity levels
Query logs: query_loki_logs with LogQL:
By device: {host="core-rtr-01"}
By severity: {host="core-rtr-01"} |= "error"
Pattern match: {job="syslog"} |~ "BGP|OSPF"
Patterns: query_loki_patterns — detect recurring log structures
Stats: query_loki_stats — log volume and rate analysis

Integration with Other Skills

Skill Integration

pyats-health-check Cross-reference pyATS health data with Grafana metrics and dashboards

pyats-routing Correlate OSPF/BGP state changes with Grafana metric timelines

gait-session-tracking Record all Grafana queries and findings in GAIT audit trail

slack-network-alerts Grafana alerts fed through Slack + NetClaw for automated investigation

servicenow-change-workflow Annotate Grafana dashboards during change windows; correlate incidents with CRs

te-network-monitoring Pair ThousandEyes path data with Grafana infrastructure metrics

aws-cloud-monitoring Compare Grafana dashboards with CloudWatch data for hybrid visibility

markmap-viz Visualize Grafana alert rule hierarchies as mind maps

Context Window Management

Grafana dashboards can be large JSON documents. Use these strategies:

Always start with get_dashboard_summary — lightweight overview, not full JSON
Use get_dashboard_property with JSONPath for specific fields
Avoid get_dashboard_by_uid unless you need the complete dashboard definition
Use get_dashboard_panel_queries to extract just the query definitions

Important Rules

Prefer read-only operations — use search_dashboards , get_dashboard_summary , query_prometheus , query_loki_logs , list_alert_rules before any write operations
Dashboard modifications require ServiceNow CR — unless in lab/dev Grafana instance
Alert rule changes require approval — creating/updating/deleting alert rules affects production monitoring
Token-efficient queries — use get_dashboard_summary over get_dashboard_by_uid , use time ranges to limit Prometheus/Loki result size
GAIT audit mandatory — record all Grafana queries, dashboard modifications, alert changes, and incident updates
No secrets in queries — never embed credentials or sensitive data in PromQL/LogQL expressions

Error Handling

Auth fails (401/403): Check GRAFANA_URL and GRAFANA_SERVICE_ACCOUNT_TOKEN in ~/.openclaw/.env . Verify service account has Editor role or required RBAC permissions.
Datasource not found: Use list_datasources to discover available datasource UIDs and names.
PromQL/LogQL errors: Use list_prometheus_metric_names or list_loki_label_names to discover valid metric/label names before querying.
Dashboard not found: Use search_dashboards to find dashboards by title before using UID-based tools.
Rate limiting: Grafana may rate-limit API requests; space out large query batches.

grafana-observability

Safety Notice

Copy this and send it to your AI assistant to learn

stdio mode (default — used by NetClaw)

Read-only mode (prevents dashboard/alert modifications)

Source Transparency

Related Skills

pyats-topology

aws-cloud-monitoring

pyats-health-check