LGTM Skill - Query Observability Backends
Why subagents matter here
lgtm commands return raw JSON — sometimes thousands of lines. If you run queries directly in the main conversation, you'll flood the context window and make it harder to reason about what actually matters. Haiku subagents are the right tool: they run the queries, distill the results, and hand you back just the signal you need.
The pattern is: you orchestrate, haiku executes.
Orchestrator Pattern
- You (orchestrator): Coordinate the discovery → investigation flow. Evaluate summaries returned by subagents, decide what to query next, synthesize findings for the user. Don't run
lgtmcommands yourself. - Haiku subagent: All query execution — discovery, investigation, aggregation, analysis. Fast and sufficient for the vast majority of tasks.
Run independent queries in parallel — spawn multiple Task calls in one message when queries don't depend on each other (e.g., check logs AND metrics AND traces simultaneously).
Two-Phase Approach
Phase 1: Discovery
Before querying blindly, discover what's available. This avoids wasted queries against wrong label names or nonexistent services.
Task tool call:
subagent_type: "Bash"
model: "haiku"
prompt: "Using lgtm CLI, discover available labels and services.
Run: lgtm loki labels
Run: lgtm loki label-values app
Run: lgtm loki label-values namespace
Run: lgtm tempo tag-values service.name
Return a concise list of available apps, namespaces, and trace services."
Phase 2: Investigation
With concrete label values in hand, query precisely:
Task tool call:
subagent_type: "Bash"
model: "haiku"
prompt: "Using lgtm CLI, investigate errors in the checkout app in prod namespace.
<specific queries based on discovery results>
Return ONLY a concise summary, not raw JSON."
Setup: Config File Required
Before querying, check if the config file exists at ~/.config/lgtm/config.yaml. If it doesn't, stop and tell the user to run lgtm discover (for Grafana Cloud) or create the config manually.
Grafana Cloud Auto-Discovery
If the user is on Grafana Cloud, they can auto-generate the config:
# Requires a Grafana Cloud Access Policy token with stacks:read scope
# Create at: Grafana Cloud → Administration → Cloud Access Policies
GRAFANA_CLOUD_API_TOKEN=glc_xxx lgtm discover
# Preview without writing
lgtm discover --dry-run
# Discover for a specific org
lgtm discover --org myorg --token glc_xxx
# Overwrite existing entries
lgtm discover --overwrite
This generates config entries for all active stacks with Loki, Prometheus, and Tempo endpoints.
Error Messages
v1.2.0+ shows clean, actionable errors instead of tracebacks:
- Nonexistent instance (
-i nonexistent): lists available instances - Empty config: suggests running
lgtm discover - Unset env vars: warns when
${VAR_NAME}references are not set
CLI Reference
lgtm is installed globally. Install with:
uv tool install lgtm-cli
Config file: ~/.config/lgtm/config.yaml
Loki (Logs)
# Discovery
lgtm loki labels
lgtm loki label-values app
lgtm loki label-values namespace
# Basic query (defaults: last 15 min, limit 50)
lgtm loki query '{app="myapp"}'
# Filter for errors
lgtm loki query '{app="myapp"} |= "error"'
# Custom time range and limit
lgtm loki query '{app="myapp"}' --start 2024-01-15T10:00:00Z --end 2024-01-15T11:00:00Z --limit 100
# Aggregations (prefer these over raw log fetches for initial overviews)
lgtm loki instant 'count_over_time({app="myapp"} |= "error" [5m])'
lgtm loki instant 'sum by (level) (count_over_time({app="myapp"} | json [5m]))'
Prometheus/Mimir (Metrics)
# Discovery
lgtm prom labels
lgtm prom label-values __name__
lgtm prom metadata --metric http_requests_total
# Instant query
lgtm prom query 'up{job="prometheus"}'
lgtm prom query 'rate(http_requests_total[5m])'
# Range query (defaults: last 15 min, 60s step)
lgtm prom range 'rate(http_requests_total[5m])'
lgtm prom range 'up' --start 2024-01-15T10:00:00Z --end 2024-01-15T11:00:00Z --step 5m
Tempo (Traces)
# Discovery
lgtm tempo tags
lgtm tempo tag-values service.name
# Search (defaults: last 15 min, limit 20)
lgtm tempo search -q '{resource.service.name="api"}'
lgtm tempo search -q '{status=error}'
lgtm tempo search --min-duration 1s
lgtm tempo search -q '{resource.service.name="api" && status=error}' --min-duration 500ms
# Get specific trace by ID
lgtm tempo trace abc123def456
Instance Selection & Discovery
lgtm instances # list configured instances
lgtm -i production loki query '{app="api"}' # use specific instance
lgtm discover # auto-discover Grafana Cloud stacks
lgtm discover --dry-run # preview without writing config
Kubernetes Port-Forward Instances
Some instances require kubectl port-forwarding to reach services inside a cluster.
lgtm port-forward # show port-forward commands for all instances that need them
lgtm -i sandbox port-forward # show for specific instance
Subagent prompt for port-forward instances:
Task tool call:
subagent_type: "Bash"
model: "haiku"
prompt: "Query sandbox cluster metrics using lgtm CLI.
1. Check the port-forward command needed:
lgtm -i sandbox port-forward
2. Start the tunnel in the background:
kubectl port-forward -n monitoring svc/victoria-metrics-server 8428:8428 --context sandbox &
sleep 2 # wait for tunnel to establish
3. Run the query:
lgtm -i sandbox prom query 'sandbox_running_count'
4. Return a summary of the results."
Output Formatting
All commands output JSON. Subagents should use jq to extract what's relevant rather than returning raw output:
# Extract just log lines
lgtm loki query '{app="api"}' | jq -r '.data.result[].values[][] | select(type == "string")'
# Extract metric values
lgtm prom query 'up' | jq -r '.data.result[] | "\(.metric.instance): \(.value[1])"'
# Trace summary
lgtm tempo search -q '{status=error}' | jq -r '.traces[] | "\(.traceID) | \(.rootServiceName) | \(.durationMs)ms"'
Subagent Prompt Examples
Discovery
Discover available observability data using lgtm CLI.
1. lgtm loki labels
2. lgtm loki label-values app
3. lgtm loki label-values namespace
4. lgtm tempo tag-values service.name
Return a concise list:
- Available apps: [list]
- Available namespaces: [list]
- Available trace services: [list]
- Any other relevant labels
Investigate Error Spike
Investigate errors in the checkout service over the last hour using lgtm CLI.
1. Get error counts: lgtm loki instant 'sum by (level) (count_over_time({app="checkout"} | json [1h]))'
2. If errors found, sample logs: lgtm loki query '{app="checkout"} |= "error"' --limit 30
3. Check traces: lgtm tempo search -q '{resource.service.name="checkout" && status=error}'
Summarize:
- Total error count and trend
- Top 3 most frequent error messages
- When errors started
- Affected components/pods
- Any correlated trace IDs
Return only the summary, not raw JSON.
Service Health Check
Check health of the payment-service using lgtm CLI.
1. Error rate: lgtm loki instant 'sum(count_over_time({app="payment-service"} |= "error" [15m]))'
2. P95 latency: lgtm prom query 'histogram_quantile(0.95, rate(http_request_duration_seconds_bucket{service="payment"}[5m]))'
3. Recent errors: lgtm loki query '{app="payment-service"} |= "error"' --limit 10
Return:
- Status: healthy/degraded/unhealthy
- Error rate (errors per minute)
- P95 latency
- Any critical issues
Trace Investigation
Investigate slow requests in the API gateway using lgtm CLI.
1. Find slow traces: lgtm tempo search -q '{resource.service.name="api-gateway"}' --min-duration 2s --limit 10
2. For the slowest trace: lgtm tempo trace <traceID>
3. Check downstream: lgtm tempo search -q '{resource.service.name="api-gateway"} >> {duration > 1s}'
Summarize:
- How many slow requests in the last 15 min
- Which downstream service is causing delays
- Common patterns in slow requests
Best Practices
Aggregations over raw data — count before you fetch. Pulling all error logs is slow and wasteful; getting a count first tells you whether it's worth digging deeper.
Use specific identifiers when you have them — if the user gives you a trace ID, request ID, or pod name, filter on it directly rather than scanning broadly.
Prefer aggregations for the initial overview:
# Get the lay of the land first
lgtm loki instant 'sum by (app) (count_over_time({namespace="prod"} |= "error" [15m]))'
# Then drill into the specific app
lgtm loki query '{namespace="prod", app="checkout"} |= "error"' --limit 20
Reference
For query syntax, see:
reference/logql.md- LogQL syntax for Lokireference/promql.md- PromQL syntax for Prometheusreference/traceql.md- TraceQL syntax for Tempo