qa-observability

QA Observability and Performance Engineering

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "qa-observability" with this command: npx skills add vasilyu1983/ai-agents-public/vasilyu1983-ai-agents-public-qa-observability

QA Observability and Performance Engineering

Use telemetry (logs, metrics, traces, profiles) as a QA signal and a debugging substrate.

Core references (see data/sources.json ): OpenTelemetry, W3C Trace Context, and SLO practices (Google SRE).

Quick Start (Default)

If key context is missing, ask for: critical user journeys, service/dependency inventory, environments (local/staging/prod), current telemetry stack, and current SLO/SLA commitments (if any).

  • Establish the minimum bar: correlation IDs + structured logs + traces + golden metrics (latency, traffic, errors, saturation).

  • Verify propagation: confirm traceparent (and your request ID) flow across boundaries end-to-end.

  • Make failures diagnosable: every test failure captures a trace link (or trace ID) plus the correlated logs.

  • Define SLIs/SLOs and error budget policy; wire burn-rate alerts (prefer multi-window burn rates).

  • Produce artifacts: a readiness checklist plus an SLO definition and alert rules (use assets/checklists/template-observability-readiness-checklist.md and assets/monitoring/slo/* ).

Default QA stance

  • Treat telemetry as part of acceptance criteria (especially for integration/E2E tests).

  • Require correlation: request_id + trace_id (traceparent) across boundaries.

  • Prefer SLO-based release gating and burn-rate alerting over raw infra thresholds.

  • Budget overhead: sampling, cardinality, retention, and cost are quality constraints.

  • Redact PII/secrets by default (logs and attributes).

Core workflows

  • Establish the minimum bar (logs + metrics + traces + correlation).

  • Instrument with OpenTelemetry (auto-instrument first, then add manual spans for key paths).

  • Verify context propagation across service boundaries (traceparent in/out).

  • Define SLIs/SLOs and error budget policy; wire burn-rate alerts.

  • Make failures diagnosable: capture a trace link + key logs on every test failure.

  • Profile and load test only after telemetry is reliable; validate against baselines.

Quick reference

Task Recommended default Notes

Tracing OpenTelemetry + Jaeger/Tempo Prefer OTLP exporters via Collector when possible

Metrics Prometheus + Grafana Use histograms for latency; watch cardinality

Logging Structured JSON + correlation IDs Never log secrets/PII; redact aggressively

Reliability gates SLOs + error budgets + burn-rate alerts Gate releases on sustained burn/regressions

Performance Profiling + load tests + budgets Add continuous profiling for intermittent issues

Zero-code visibility eBPF (OpenTelemetry zero-code) + continuous profiling (Parca/Pyroscope) Use when code changes are not feasible

Navigation

Open these guides when needed:

If the user needs... Read Also use

A minimal, production-ready baseline references/core-observability-patterns.md

assets/checklists/template-observability-readiness-checklist.md

Node/Python instrumentation setup references/opentelemetry-best-practices.md

assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md , assets/opentelemetry/python/opentelemetry-python-setup.md

Working trace propagation across services references/distributed-tracing-patterns.md

assets/checklists/template-observability-readiness-checklist.md

SLOs, burn-rate alerts, and release gates references/slo-design-guide.md

assets/monitoring/slo/slo-definition.yaml , assets/monitoring/slo/prometheus-alert-rules.yaml

Profiling/load testing with evidence references/performance-profiling-guide.md

assets/load-testing/load-testing-k6.js , assets/load-testing/template-load-test-artillery.yaml

A maturity model and roadmap references/observability-maturity-model.md

assets/checklists/template-observability-readiness-checklist.md

What to avoid and how to fix it references/anti-patterns-best-practices.md

assets/checklists/template-observability-readiness-checklist.md

Alert design and fatigue reduction references/alerting-strategies.md

assets/monitoring/slo/prometheus-alert-rules.yaml

Dashboard hierarchy and layout references/dashboard-design-patterns.md

assets/monitoring/grafana/template-grafana-dashboard-observability.json

Structured logging and cost control references/log-aggregation-patterns.md

assets/observability/template-logging-setup.md

Implementation guides (deep dives):

  • references/core-observability-patterns.md

  • references/opentelemetry-best-practices.md

  • references/distributed-tracing-patterns.md

  • references/slo-design-guide.md

  • references/performance-profiling-guide.md

  • references/observability-maturity-model.md

  • references/anti-patterns-best-practices.md

  • references/alerting-strategies.md

  • references/dashboard-design-patterns.md

  • references/log-aggregation-patterns.md

Templates (copy/paste):

  • assets/checklists/template-observability-readiness-checklist.md

  • assets/opentelemetry/nodejs/opentelemetry-nodejs-setup.md

  • assets/opentelemetry/python/opentelemetry-python-setup.md

  • assets/monitoring/slo/slo-definition.yaml

  • assets/monitoring/slo/prometheus-alert-rules.yaml

  • assets/monitoring/grafana/grafana-dashboard-slo.json

  • assets/monitoring/grafana/template-grafana-dashboard-observability.json

  • assets/load-testing/load-testing-k6.js

  • assets/load-testing/template-load-test-artillery.yaml

  • assets/performance/frontend/template-lighthouse-ci.json

  • assets/performance/backend/template-nodejs-profiling-config.js

Curated sources:

  • data/sources.json

Scope boundaries (handoffs)

  • Pure infrastructure monitoring (Kubernetes, Docker, CI/CD): ../ops-devops-platform/SKILL.md

  • Database query optimization (SQL tuning, indexing): ../data-sql-optimization/SKILL.md

  • Application-level debugging (stack traces, breakpoints): ../qa-debugging/SKILL.md

  • Test strategy design (coverage, test pyramids): ../qa-testing-strategy/SKILL.md

  • Resilience patterns (retries, circuit breakers): ../qa-resilience/SKILL.md

  • Architecture decisions (microservices, event-driven): ../software-architecture-design/SKILL.md

Tool selection notes (2026)

  • Default to OpenTelemetry + OTLP + Collector where possible.

  • Prefer burn-rate alerting against SLOs over alerting on raw infra metrics.

  • Treat sampling, cardinality, and retention as part of quality (not an afterthought).

  • When asked to pick vendors/tools, start from data/sources.json and validate time-sensitive claims with current docs/releases if the environment allows it.

Fact-Checking

  • Use web search/web fetch to verify current external facts, versions, pricing, deadlines, regulations, or platform behavior before final answers.

  • Prefer primary sources; report source links and dates for volatile information.

  • If web access is unavailable, state the limitation and mark guidance as unverified.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

product-management

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

marketing-visual-design

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

startup-idea-validation

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

software-architecture-design

No summary provided by upstream source.

Repository SourceNeeds Review