ops-mcp-server

Query observability data and execute operational procedures via the ops-mcp-server MCP interface. Covers Kubernetes events, Prometheus metrics, Elasticsearch logs, Jaeger distributed traces, and SOPS runbooks.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "ops-mcp-server" with this command: npx skills add shaowenchen/ops-mcp-server

Ops MCP Server Skill

Access your infrastructure's observability data and execute operational procedures through a unified MCP interface.

Capabilities at a Glance

ModuleToolsWhat it answers
Events (Kubernetes)list-events-from-ops, get-events-from-opsWhat happened to a pod/deployment/node?
Metrics (Prometheus)list-metrics-from-prometheus, query-metrics-from-prometheus, query-metrics-range-from-prometheusIs CPU/memory/traffic normal? What changed over time?
Logs (Elasticsearch)list-log-indices-from-elasticsearch, search-logs-from-elasticsearch, query-logs-from-elasticsearchWhat errors are in the logs? What did service X log?
Traces (Jaeger)get-services-from-jaeger, get-operations-from-jaeger, find-traces-from-jaeger, get-trace-from-jaegerWhy is this request slow? Where did it fail?
SOPSlist-sops-from-ops, list-sops-parameters-from-ops, execute-sops-from-opsRun a standard operational procedure

Setup (first-time)

# 1. Use mcporter with npx (no installation needed)
# Or install globally: npm i -g mcporter

# 2. Register the server
cd ~/.openclaw/workspace
npx mcporter config add ops-mcp-server --url http://localhost/mcp

# 3. Authenticate (if needed)
npx mcporter auth ops-mcp-server
# On failure, add to ~/.openclaw/workspace/config/mcporter.json:
# "headers": { "Authorization": "Bearer YOUR_TOKEN" }

# 4. Verify
npx mcporter list ops-mcp-server
npx mcporter call ops-mcp-server list-events-from-ops page_size=5

# 5. Set env var
export OPS_MCP_SERVER_URL="http://localhost/mcp"

How to Investigate: Decision Guide

When a user describes a problem, use this guide to choose starting tools and build a complete picture.

🔴 "Something is broken / service is down"

  1. Kubernetes Events first — check if pods crashed, restarted, or got evicted
    get-events-from-ops  subject_pattern="ops.clusters.*.namespaces.<ns>.pods.*.events"
    
  2. Logs — search for errors around the time of the incident
    query-logs-from-elasticsearch  query="FROM logs-* | WHERE @timestamp > NOW() - 30 minutes | WHERE level == 'error' | LIMIT 50"
    
  3. Traces — find failed or slow requests
    find-traces-from-jaeger  serviceName=<service>  tags={"error":"true"}
    

🟡 "Performance is degraded / requests are slow"

  1. Metrics — check resource saturation
    query-metrics-from-prometheus  query="100 - (avg(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100)"
    query-metrics-range-from-prometheus  query="node_memory_MemAvailable_bytes"  time_range="1h"  step="1m"
    
  2. Traces — find slow spans
    find-traces-from-jaeger  serviceName=<service>  durationMin=1000
    
  3. Logs — look for timeouts or slow query warnings

🔵 "I need to run a procedure / restart something"

  1. List available SOPs
    list-sops-from-ops
    
  2. Get parameters
    list-sops-parameters-from-ops  sops_id=<id>
    
  3. Execute
    execute-sops-from-ops  sops_id=<id>  parameters='{...}'
    

🟢 "General health check / nothing specific"

Start with events + a key metrics query, then go deeper based on what you find.


Tool Quick Reference

Events — NATS subject pattern format

# Namespace resources
ops.clusters.{cluster}.namespaces.{ns}.{resourceType}.{name}.{observation}

# Node level
ops.clusters.{cluster}.nodes.{nodeName}.{observation}

# Notifications
ops.notifications.providers.{provider}.channels.{channel}.severities.{severity}

Wildcards: * = one segment, > = everything remaining (tail only)

Observation types: status | events | alerts | findings

Time is Unix milliseconds: $(date +%s)000

Logs — ES|QL query patterns

-- Recent errors
FROM logs-* | WHERE @timestamp > NOW() - 30 minutes | WHERE level == 'error' | LIMIT 100

-- Top errors by frequency
FROM logs-* | WHERE @timestamp > NOW() - 1 hour | WHERE level == 'error'
| STATS count() BY message | SORT count DESC | LIMIT 10

-- Specific service
FROM logs-* | WHERE service == 'checkout-service' | WHERE @timestamp > NOW() - 1 hour | LIMIT 50

Metrics — PromQL patterns

# CPU usage
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

# Memory available
node_memory_MemAvailable_bytes

# HTTP error rate
rate(http_requests_total{status=~"5.."}[5m])

Detailed Examples & Reference Files

For complete parameter lists, output formats, and advanced patterns, read the relevant file:

  • eventsexamples/events.md
  • metricsexamples/metrics.md
  • logsexamples/logs.md
  • tracesexamples/traces.md
  • sopsexamples/sops.md
  • event subject format designreferences/design.md

Read the relevant example file before making complex tool calls you're unsure about.


What This Skill is NOT For

  • Direct infrastructure changes (use dedicated automation tooling)
  • Real-time alerting (investigation only, not a monitoring agent)
  • Writing to or modifying operational data (all access is read-only)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Ai Agent Builder

快速构建和部署支持多工具集成与记忆管理的自定义 AI Agent,适用于客服、数据采集和研究自动化。

Registry SourceRecently Updated
Automation

GolemedIn MCP

Discover AI agents, manage agent profiles, post updates, search jobs, and message other agents on GolemedIn — the open agent registry.

Registry SourceRecently Updated
Automation

Agent HQ

Deploy the Agent HQ mission-control stack (Express + React + Telegram notifier / Jarvis summary) so other Clawdbot teams can spin up the same board, high-priority watcher, and alert automation. Includes setup, telemetry, and automation hooks.

Registry SourceRecently Updated