Datadog Analysis

Authentication

IMPORTANT: Credentials are injected automatically by a proxy layer. Do NOT check for DATADOG_API_KEY or DATADOG_APP_KEY in environment variables - they won't be visible to you. Just run the scripts directly; authentication is handled transparently.

Configuration environment variables you CAN check (non-secret):

DATADOG_SITE
Datadog site (e.g., us5.datadoghq.com , datadoghq.eu )

MANDATORY: Statistics-First Investigation

NEVER dump raw logs. Always follow this pattern:

STATISTICS → SAMPLE → PATTERNS → CORRELATE

Statistics First - Know volume, error rate, and top patterns before sampling
Strategic Sampling - Choose the right strategy based on statistics
Pattern Extraction - Cluster similar errors to find root causes
Context Correlation - Investigate around anomaly timestamps

Available Scripts

All scripts are in .claude/skills/observability-datadog/scripts/

PRIMARY INVESTIGATION SCRIPTS

get_statistics.py - ALWAYS START HERE

Comprehensive statistics with pattern extraction.

python .claude/skills/observability-datadog/scripts/get_statistics.py [--service SERVICE] [--time-range MINUTES]

Examples:

python .claude/skills/observability-datadog/scripts/get_statistics.py --time-range 60 python .claude/skills/observability-datadog/scripts/get_statistics.py --service payment

Output includes:

Total count, error count, error rate percentage
Status distribution (info, warn, error)
Top services by log volume
Top error patterns (crucial for quick triage)
Actionable recommendation

sample_logs.py - Strategic Sampling

Choose the right sampling strategy based on statistics.

python .claude/skills/observability-datadog/scripts/sample_logs.py --strategy STRATEGY [--service SERVICE] [--limit N]

Strategies:

errors_only - Only error logs (default for incidents)

warnings_up - Warning and error logs

around_time - Logs around a specific timestamp

all - All log levels

Examples:

python .claude/skills/observability-datadog/scripts/sample_logs.py --strategy errors_only --service payment python .claude/skills/observability-datadog/scripts/sample_logs.py --strategy around_time --timestamp "2026-01-27T05:00:00Z" --window 5

Datadog Query Language (DQL)

Basic Filters

Service filter

service:payment

Status filter

status:error status:warn

Host filter

host:web-server-01

Combine with AND (space) or OR

service:payment status:error service:payment OR service:checkout

Facet Filters

Tag filter

env:production version:1.2.3

Attribute filter

@http.status_code:>=500 @duration:>1000

Wildcard

service:payment-*

Time Ranges

Relative

@timestamp:[now-1h TO now]

Absolute

@timestamp:[2026-01-27T00:00:00Z TO 2026-01-27T12:00:00Z]

Common Patterns

All errors in last hour

status:error

Errors for specific service

service:api-gateway status:error

Slow requests (>1s)

@duration:>1000000

HTTP 5xx errors

@http.status_code:>=500

Exceptions

exception OR error OR failed

Investigation Workflow

Standard Incident Investigation

┌─────────────────────────────────────────────────────────────┐ │ 1. STATISTICS FIRST (mandatory) │ │ python get_statistics.py --service <service> │ │ → Know volume, error rate, top patterns │ └─────────────────────────────────────────────────────────────┘ │ ▼ High Error Rate? ┌─────────────┴─────────────┐ │ │ YES (>5%) NO │ │ ▼ ▼ ┌─────────────────────────────┐ ┌───────────────────────────────────────────┐ │ 2. FAST PATH │ │ 2. TARGETED INVESTIGATION │ │ Sample errors directly │ │ Filter by specific criteria │ │ python sample_logs.py │ │ python sample_logs.py --strategy all │ │ --strategy errors_only │ │ → Look for anomalies │ └─────────────────────────────┘ └───────────────────────────────────────────┘

Quick Commands Reference

Goal Command

Start investigation get_statistics.py --service X

Sample errors only sample_logs.py --strategy errors_only --service X

Investigate spike sample_logs.py --strategy around_time --timestamp T

All logs sample_logs.py --strategy all --service X --limit 20

Metrics Query Syntax

Basic Structure

aggregation:metric_name{tag_filters}

Aggregations

avg: - Average across series sum: - Sum across series min: - Minimum value max: - Maximum value p50: - 50th percentile (APM) p95: - 95th percentile (APM) p99: - 99th percentile (APM)

Common Metrics

System

avg:system.cpu.user{service:X} avg:system.mem.used{service:X}

APM (traces)

sum:trace.http.request.hits{service:X}.as_rate() sum:trace.http.request.errors{service:X}.as_rate() p95:trace.http.request.duration{service:X}

Anti-Patterns to Avoid

❌ NEVER skip statistics - get_statistics.py is MANDATORY first step
❌ Unbounded queries - Always specify time ranges and limits
❌ Fetching all logs - Use sampling strategies, not unbounded searches
❌ Ignoring error rate - High error rate means immediate investigation
❌ Missing service filter - For multi-service apps, always filter by service

datadog-analysis

Safety Notice

Copy this and send it to your AI assistant to learn

Examples:

Strategies:

errors_only - Only error logs (default for incidents)

warnings_up - Warning and error logs

around_time - Logs around a specific timestamp

all - All log levels

Examples:

Service filter

Status filter

Host filter

Combine with AND (space) or OR

Tag filter

Attribute filter

Wildcard

Relative

Absolute

All errors in last hour

Errors for specific service

Slow requests (>1s)

HTTP 5xx errors

Exceptions

System

APM (traces)

Source Transparency

Related Skills

log-analysis

metrics-analysis

knowledge-base