monitoring-setup

Application monitoring and observability setup for Python/React projects. Use when configuring logging, metrics collection, health checks, alerting rules, or dashboard creation. Covers structured logging with structlog, Prometheus metrics for FastAPI, health check endpoints, alert threshold design, Grafana dashboard patterns, error tracking with Sentry, and uptime monitoring. Does NOT cover incident response procedures (use incident-response) or deployment (use deployment-pipeline).

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "monitoring-setup" with this command: npx skills add hieutrtr/ai1-skills/hieutrtr-ai1-skills-monitoring-setup

Monitoring Setup

When to Use

Activate this skill when:

  • Setting up structured logging for a Python/FastAPI application
  • Configuring Prometheus metrics collection and custom counters/histograms
  • Implementing health check endpoints (liveness and readiness)
  • Designing alert rules and thresholds for production services
  • Creating Grafana dashboards for service monitoring
  • Integrating Sentry for error tracking and performance monitoring
  • Implementing distributed tracing with OpenTelemetry
  • Reviewing or improving existing observability coverage

Output: Write observability configuration summary to monitoring-config.md documenting what was set up (metrics, alerts, dashboards, health checks).

Do NOT use this skill for:

  • Responding to active production incidents (use incident-response)
  • Deploying monitoring infrastructure (use deployment-pipeline)
  • Writing application business logic (use python-backend-expert)
  • Docker container configuration (use docker-best-practices)

Instructions

Four Pillars of Observability

Every production service must implement all four pillars.

┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY                                │
├────────────────┬───────────────┬──────────────┬────────────────┤
│    METRICS     │   LOGGING     │   TRACING    │   ALERTING     │
│                │               │              │                │
│  Prometheus    │  structlog    │ OpenTelemetry│  Alert rules   │
│  counters,     │  structured   │ distributed  │  thresholds,   │
│  histograms,   │  JSON logs,   │ trace spans, │  notification  │
│  gauges        │  context      │ correlation  │  channels      │
├────────────────┴───────────────┴──────────────┴────────────────┤
│                    DASHBOARDS (Grafana)                         │
│        Visualize metrics, logs, and traces in one place        │
└─────────────────────────────────────────────────────────────────┘

Pillar 1: Metrics (Prometheus)

Use the RED method for request-driven services and USE method for resources.

RED Method (for every API endpoint):

  • Rate -- Requests per second
  • Errors -- Failed requests per second
  • Duration -- Request latency distribution

USE Method (for infrastructure resources):

  • Utilization -- Percentage of resource used (CPU, memory, disk)
  • Saturation -- Work queued or waiting (connection pool, queue depth)
  • Errors -- Error events (OOM kills, connection failures)

Key metrics to instrument:

from prometheus_client import Counter, Histogram, Gauge, Info

# RED metrics
REQUEST_COUNT = Counter(
    "http_requests_total",
    "Total HTTP requests",
    labelnames=["method", "endpoint", "status_code"],
)

REQUEST_DURATION = Histogram(
    "http_request_duration_seconds",
    "HTTP request duration in seconds",
    labelnames=["method", "endpoint"],
    buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0],
)

# USE metrics
DB_POOL_USAGE = Gauge(
    "db_connection_pool_usage",
    "Database connection pool utilization",
    labelnames=["pool_name"],
)

DB_POOL_SIZE = Gauge(
    "db_connection_pool_size",
    "Database connection pool max size",
    labelnames=["pool_name"],
)

REDIS_CONNECTIONS = Gauge(
    "redis_active_connections",
    "Active Redis connections",
)

# Business metrics
ACTIVE_USERS = Gauge(
    "active_users_total",
    "Currently active users",
)

APP_INFO = Info(
    "app",
    "Application metadata",
)

FastAPI middleware for automatic metrics:

import time
from starlette.middleware.base import BaseHTTPMiddleware
from starlette.requests import Request

class PrometheusMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request: Request, call_next):
        method = request.method
        endpoint = request.url.path
        start_time = time.perf_counter()

        response = await call_next(request)

        duration = time.perf_counter() - start_time
        status_code = str(response.status_code)

        REQUEST_COUNT.labels(
            method=method, endpoint=endpoint, status_code=status_code
        ).inc()

        REQUEST_DURATION.labels(
            method=method, endpoint=endpoint
        ).observe(duration)

        return response

See references/metrics-config-template.py for the complete setup.

Pillar 2: Logging (structlog)

Use structured JSON logging with contextual information. Never use print() or unstructured logging in production.

Logging principles:

  1. Structured -- JSON format, machine-parseable
  2. Contextual -- Include request ID, user ID, trace ID in every log
  3. Leveled -- Use appropriate log levels (DEBUG, INFO, WARNING, ERROR, CRITICAL)
  4. Actionable -- Every WARNING/ERROR log should indicate what to investigate

Log levels and when to use them:

LevelWhen to UseExample
DEBUGDetailed diagnostic info, disabled in productionProcessing item 42 of 100
INFONormal operations, significant eventsUser created, Payment processed
WARNINGUnexpected but handled situationRetry attempt 2 of 3, Cache miss
ERROROperation failed, needs attentionDatabase query failed, External API timeout
CRITICALSystem-level failure, immediate actionCannot connect to database, Out of memory

structlog setup:

import structlog

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.stdlib.filter_by_level,
        structlog.stdlib.add_logger_name,
        structlog.stdlib.add_log_level,
        structlog.stdlib.PositionalArgumentsFormatter(),
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.UnicodeDecoder(),
        structlog.processors.JSONRenderer(),
    ],
    wrapper_class=structlog.stdlib.BoundLogger,
    context_class=dict,
    logger_factory=structlog.stdlib.LoggerFactory(),
    cache_logger_on_first_use=True,
)

Adding request context:

from starlette.middleware.base import BaseHTTPMiddleware
import structlog
import uuid

class LoggingContextMiddleware(BaseHTTPMiddleware):
    async def dispatch(self, request, call_next):
        request_id = request.headers.get("X-Request-ID", str(uuid.uuid4()))
        structlog.contextvars.clear_contextvars()
        structlog.contextvars.bind_contextvars(
            request_id=request_id,
            method=request.method,
            path=request.url.path,
        )
        response = await call_next(request)
        response.headers["X-Request-ID"] = request_id
        return response

See references/logging-config-template.py for the complete setup.

Pillar 3: Tracing (OpenTelemetry)

Distributed tracing connects logs and metrics across service boundaries.

Trace setup for FastAPI:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
from opentelemetry.instrumentation.redis import RedisInstrumentor
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

def setup_tracing(app, service_name: str = "backend"):
    resource = Resource.create({"service.name": service_name})
    provider = TracerProvider(resource=resource)

    exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
    provider.add_span_processor(BatchSpanProcessor(exporter))

    trace.set_tracer_provider(provider)

    # Auto-instrument FastAPI, SQLAlchemy, Redis
    FastAPIInstrumentor.instrument_app(app)
    SQLAlchemyInstrumentor().instrument()
    RedisInstrumentor().instrument()

Custom spans for business logic:

tracer = trace.get_tracer(__name__)

async def process_order(order_id: str):
    with tracer.start_as_current_span("process_order") as span:
        span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_order"):
            await validate_order(order_id)

        with tracer.start_as_current_span("charge_payment"):
            result = await charge_payment(order_id)
            span.set_attribute("payment.status", result.status)

        with tracer.start_as_current_span("send_confirmation"):
            await send_confirmation(order_id)

Pillar 4: Alerting

Alerts must be actionable. Every alert should indicate what is broken and what to do.

Alert design principles:

  1. Page only for user-impacting issues -- Do not page for non-urgent warnings
  2. Set thresholds based on SLOs -- Not arbitrary numbers
  3. Avoid alert fatigue -- If an alert fires often without action, fix or remove it
  4. Include runbook links -- Every alert should link to a remediation guide
  5. Use multi-window burn rates -- Detect issues faster without false positives

Alert thresholds for a typical FastAPI application:

AlertConditionSeverityAction
High error ratehttp_requests_total{status=~"5.."} > 5% of total for 5 minSEV2Check logs, consider rollback
High latencyhttp_request_duration_seconds p99 > 2s for 5 minSEV3Check DB queries, dependencies
Service downHealth check fails for 2 minSEV1Restart, check logs, escalate
DB connections highPool usage > 80% for 5 minSEV3Check for connection leaks
DB connections criticalPool usage > 95% for 2 minSEV2Restart app, investigate
Memory highContainer memory > 85% for 10 minSEV3Check for memory leaks
Disk space lowDisk usage > 85%SEV3Clean logs, expand volume
Certificate expirySSL cert expires in < 14 daysSEV4Renew certificate

See references/alert-rules-template.yml for Prometheus alerting rules.

Health Check Endpoints

Every service must expose two health endpoints.

Liveness (/health): Is the process running? Returns 200 if the application is alive.

Readiness (/health/ready): Can the service handle requests? Checks all dependencies.

from fastapi import APIRouter, Depends
from sqlalchemy.ext.asyncio import AsyncSession
from sqlalchemy import text
from datetime import datetime, timezone

router = APIRouter(tags=["health"])

@router.get("/health")
async def liveness():
    """Liveness probe -- is the process running?"""
    return {
        "status": "healthy",
        "timestamp": datetime.now(timezone.utc).isoformat(),
        "version": settings.APP_VERSION,
    }

@router.get("/health/ready")
async def readiness(db: AsyncSession = Depends(get_db)):
    """Readiness probe -- can we handle traffic?"""
    checks = {}

    # Check database
    try:
        await db.execute(text("SELECT 1"))
        checks["database"] = {"status": "ok", "latency_ms": 0}
    except Exception as e:
        checks["database"] = {"status": "error", "error": str(e)}

    # Check Redis
    try:
        start = time.perf_counter()
        await redis.ping()
        latency = (time.perf_counter() - start) * 1000
        checks["redis"] = {"status": "ok", "latency_ms": round(latency, 2)}
    except Exception as e:
        checks["redis"] = {"status": "error", "error": str(e)}

    all_ok = all(c["status"] == "ok" for c in checks.values())
    return JSONResponse(
        status_code=200 if all_ok else 503,
        content={
            "status": "ready" if all_ok else "not_ready",
            "checks": checks,
            "timestamp": datetime.now(timezone.utc).isoformat(),
        },
    )

Error Tracking with Sentry

Sentry captures unhandled exceptions and performance data.

Setup:

import sentry_sdk
from sentry_sdk.integrations.fastapi import FastApiIntegration
from sentry_sdk.integrations.sqlalchemy import SqlalchemyIntegration

sentry_sdk.init(
    dsn=settings.SENTRY_DSN,
    environment=settings.APP_ENV,
    release=settings.APP_VERSION,
    traces_sample_rate=0.1,  # 10% of requests for performance monitoring
    profiles_sample_rate=0.1,
    integrations=[
        FastApiIntegration(),
        SqlalchemyIntegration(),
    ],
    # Do not send PII
    send_default_pii=False,
    # Filter out health check noise
    before_send=filter_health_checks,
)

def filter_health_checks(event, hint):
    """Do not send health check errors to Sentry."""
    if "request" in event and event["request"].get("url", "").endswith("/health"):
        return None
    return event

Dashboard Design

Grafana dashboards should follow a consistent layout pattern.

Standard dashboard sections:

  1. Overview row -- Key SLIs at a glance (error rate, latency, throughput)
  2. RED metrics row -- Rate, Errors, Duration for each endpoint
  3. Infrastructure row -- CPU, memory, disk, network
  4. Dependencies row -- Database, Redis, external API health
  5. Business metrics row -- Application-specific KPIs

Dashboard layout:

┌─────────────────────────────────────────────────────────┐
│  Service Overview                                       │
│  [Error Rate %] [p99 Latency] [Requests/s] [Uptime]    │
├────────────────────────┬────────────────────────────────┤
│  Request Rate          │  Error Rate                    │
│  (by endpoint)         │  (by endpoint, status code)    │
├────────────────────────┼────────────────────────────────┤
│  Latency (p50/p95/p99) │  Active Connections            │
│  (by endpoint)         │  (DB pool, Redis)              │
├────────────────────────┴────────────────────────────────┤
│  Infrastructure                                         │
│  [CPU %] [Memory %] [Disk %] [Network IO]              │
├─────────────────────────────────────────────────────────┤
│  Dependencies                                           │
│  [DB Latency] [Redis Latency] [External API Status]    │
└─────────────────────────────────────────────────────────┘

See references/dashboard-template.json for a complete Grafana dashboard template.

Uptime Monitoring

External uptime monitoring validates the service from a user's perspective.

What to monitor externally:

  • /health endpoint from multiple geographic regions
  • Key user-facing pages (login, dashboard, API docs)
  • SSL certificate validity and expiration
  • DNS resolution time

Recommended check intervals:

CheckIntervalTimeoutRegions
Health endpoint30 seconds10 seconds3+ regions
Key pages1 minute15 seconds2+ regions
SSL certificate6 hours30 seconds1 region
DNS resolution5 minutes5 seconds3+ regions

Quick Reference

See references/ for complete templates: logging-config-template.py, metrics-config-template.py, alert-rules-template.yml, dashboard-template.json.

Monitoring Checklist for New Services

  • structlog configured with JSON output
  • Request logging middleware with request ID correlation
  • Prometheus metrics endpoint exposed at /metrics
  • RED metrics instrumented (request count, errors, duration)
  • Health check endpoints implemented (/health, /health/ready)
  • Sentry SDK initialized with environment and release tags
  • Alert rules defined for error rate, latency, and availability
  • Grafana dashboard created with standard sections
  • External uptime monitoring configured
  • Log retention policy defined (default: 30 days)

Output File

Write monitoring configuration summary to monitoring-config.md:

# Monitoring Configuration: [Service Name]

## Metrics

| Metric | Type | Labels | Purpose |
|--------|------|--------|---------|
| http_requests_total | Counter | method, endpoint, status | RED: Request rate |
| http_request_duration_seconds | Histogram | method, endpoint | RED: Latency |

## Alerts

| Alert | Condition | Severity | Runbook |
|-------|-----------|----------|---------|
| HighErrorRate | error_rate > 5% for 5m | SEV2 | docs/runbooks/high-error-rate.md |

## Health Checks

- `/health` — Liveness probe
- `/health/ready` — Readiness probe (checks DB, Redis)

## Dashboards

- Grafana: Service Overview (imported from references/dashboard-template.json)

## Next Steps

- Run `/deployment-pipeline` to deploy with monitoring enabled
- Run `/incident-response` if alerts fire

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

python-backend-expert

No summary provided by upstream source.

Repository SourceNeeds Review
General

e2e-testing

No summary provided by upstream source.

Repository SourceNeeds Review
Security

code-review-security

No summary provided by upstream source.

Repository SourceNeeds Review