monitoring-skill

Monitoring & Observability Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "monitoring-skill" with this command: npx skills add pluginagentmarketplace/custom-plugin-devops/pluginagentmarketplace-custom-plugin-devops-monitoring-skill

Monitoring & Observability Skill

Overview

Master the three pillars of observability: metrics, logs, and traces.

Parameters

Name Type Required Default Description

pillar string No all Observability pillar

tool string No prometheus Tool focus

Core Topics

MANDATORY

  • Prometheus metrics and PromQL

  • Grafana dashboards

  • ELK Stack basics

  • SLIs, SLOs, error budgets

  • Alerting rules

OPTIONAL

  • Distributed tracing

  • OpenTelemetry

  • Custom exporters

  • Log correlation

ADVANCED

  • High cardinality handling

  • Recording rules

  • Federation

  • Continuous profiling

Quick Reference

PromQL

sum(rate(http_requests_total[5m])) by (service) histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) 100 * sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

Prometheus API

curl http://localhost:9090/api/v1/targets curl 'http://localhost:9090/api/v1/query?query=up' curl -X POST http://localhost:9090/-/reload

Alertmanager

amtool silence add alertname="HighLatency" --duration=2h amtool alert

SRE Golden Signals

Signal Metric

Latency histogram_quantile(0.99, ...)

Traffic sum(rate(requests_total[5m]))

Errors rate(errors_total[5m])

Saturation node_memory_MemAvailable_bytes

Troubleshooting

Common Failures

Symptom Root Cause Solution

No data Scrape failing Check targets page

Alert not firing PromQL error Test in UI

High cardinality Too many labels Reduce labels

Slow queries Too much data Add aggregation

Debug Checklist

  • Check targets: /targets

  • Test query in UI

  • Check logs: journalctl -u prometheus

  • Verify time sync (NTP)

Recovery Procedures

Prometheus OOM

  • Check cardinality

  • Reduce retention

  • Add federation

Resources

  • Prometheus Docs

  • Grafana Docs

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

linux-fundamentals-skill

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

networking-skill

No summary provided by upstream source.

Repository SourceNeeds Review