holmesgpt-skill

Guide for implementing HolmesGPT - an AI agent for troubleshooting cloud-native environments. Use when investigating Kubernetes issues, analyzing alerts from Prometheus/AlertManager/PagerDuty, performing root cause analysis, configuring HolmesGPT installations (CLI/Helm/Docker), setting up AI providers (OpenAI/Anthropic/Azure), creating custom toolsets, or integrating with observability platforms (Grafana, Loki, Tempo, DataDog).

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "holmesgpt-skill" with this command: npx skills add julianobarbosa/claude-code-skills/julianobarbosa-claude-code-skills-holmesgpt-skill

HolmesGPT Skill

AI-powered troubleshooting for Kubernetes and cloud-native environments.

Overview

HolmesGPT is a CNCF Sandbox project that connects AI models with live observability data to investigate infrastructure problems, find root causes, and suggest remediations. It operates with read-only access and respects RBAC permissions, making it safe for production environments.

Quick Reference

TopicReference
Installationreferences/installation.md
Configurationreferences/configuration.md
Data Sourcesreferences/data-sources.md
Commandsreferences/commands.md
Troubleshootingreferences/troubleshooting.md
HTTP APIreferences/http-api.md
Integrationsreferences/integrations.md

Key Features

  • Root Cause Analysis: Investigates alerts and cluster issues
  • Multi-Source Integration: 30+ toolsets (K8s, Prometheus, Grafana)
  • Alert Integration: AlertManager, PagerDuty, OpsGenie, Jira, Slack
  • Interactive Mode: Troubleshooting with /run, /show, /clear
  • Custom Toolsets: Extend with proprietary tools via YAML configuration
  • CI/CD Integration: Automated deployment failure investigation

Installation Quick Start

CLI (Homebrew)

brew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
export ANTHROPIC_API_KEY="your-key"  # or OPENAI_API_KEY
holmes ask "what pods are unhealthy?"

Kubernetes (Helm)

helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install holmesgpt robusta/holmes -f values.yaml

Docker

docker run -it --net=host \
  -e OPENAI_API_KEY="your-key" \
  -v ~/.kube/config:/root/.kube/config \
  us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
  ask "what pods are crashing?"

Essential Commands

# Basic investigation
holmes ask "what pods are unhealthy and why?"
holmes ask "why is my deployment failing?"

# Interactive mode
holmes ask "investigate issue" --interactive

# Alert investigation
holmes investigate alertmanager --alertmanager-url http://localhost:9093
holmes investigate pagerduty --pagerduty-api-key <KEY> --update

# With file context
holmes ask "summarize the key points" -f ./logs.txt

# CI/CD integration
holmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>

Supported AI Providers

ProviderEnvironment VariableModels
AnthropicANTHROPIC_API_KEYSonnet 4, Opus 4.5
OpenAIOPENAI_API_KEYGPT-4.1, GPT-4o
Azure OpenAIAZURE_API_KEYGPT-4.1
AWS BedrockAWS credentialsClaude 3.5 Sonnet
Google GeminiGEMINI_API_KEYGemini 1.5 Pro
Vertex AIVERTEXAI_PROJECTGemini 1.5 Pro
OllamaLocal installLlama 3.1, Mistral

Basic Helm Values Structure

# values.yaml for Kubernetes deployment
image:
  repository: robustadev/holmes
  tag: latest

env:
  - name: ANTHROPIC_API_KEY
    valueFrom:
      secretKeyRef:
        name: holmesgpt-secrets
        key: anthropic-api-key

# Model configuration
modelList:
  sonnet:
    api_key: "{{ env.ANTHROPIC_API_KEY }}"
    model: anthropic/claude-sonnet-4-20250514
    temperature: 0

# Toolsets to enable
toolsets:
  kubernetes/core:
    enabled: true
  kubernetes/logs:
    enabled: true
  prometheus/metrics:
    enabled: true

# Resources
resources:
  requests:
    memory: "1024Mi"
    cpu: "100m"
  limits:
    memory: "1024Mi"

# RBAC (read-only by default)
createServiceAccount: true

Interactive Mode Commands

CommandDescription
/clearReset context when changing topics
/runExecute custom commands and share output with AI
/showDisplay complete tool outputs
/contextReview accumulated investigation information

Custom Toolset Example

# custom-toolset.yaml
toolsets:
  my-custom-tool:
    description: "Custom diagnostic tool"
    tools:
      - name: check_service_health
        description: "Check health of a specific service"
        command: |
          curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health
        parameters:
          - name: service_name
            description: "Name of the service"
          - name: namespace
            description: "Kubernetes namespace"

Use with: holmes ask "check health" -t custom-toolset.yaml

Kubernetes Annotations for Integration

# Add to Services/Deployments for HolmesGPT context
metadata:
  annotations:
    holmesgpt.dev/runbook: |
      This service handles payment processing.
      Common issues: database connectivity, API rate limits.
      Check: kubectl logs -l app=payment-service

Environment Variables Reference

VariableDescriptionDefault
HOLMES_CONFIG_PATHConfig file path~/.holmes/config.yaml
HOLMES_LOG_LEVELLog verbosityINFO
PROMETHEUS_URLPrometheus server URL-
GITHUB_TOKENGitHub API token-
DATADOG_API_KEYDataDog API key-
CONFLUENCE_BASE_URLConfluence URL-

Best Practices

  1. Use Specific Queries: Include namespace, deployment name, symptoms
  2. Start with Claude Sonnet 4.0/4.5: Best accuracy for complex investigations
  3. Enable Relevant Toolsets: Only enable what you need to reduce noise
  4. Use Interactive Mode: For complex multi-step investigations
  5. Set Up Runbooks: Provide context for known alert types
  6. CI/CD Integration: Automate deployment failure analysis

Security Considerations

  • HolmesGPT uses read-only access (get, list, watch only)
  • Respects existing RBAC permissions
  • Never modifies, creates, or deletes resources
  • API keys stored in Kubernetes Secrets
  • Data not used for model training

Official Resources

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

obsidian-vault-management

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

zabbix

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

neovim

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

obsidian

No summary provided by upstream source.

Repository SourceNeeds Review