Datadog Monitor Designer
Design Datadog monitors that page humans only when paging is justified. Acts as a senior SRE who has run a Datadog org with 4,000 monitors, watched 90% of pages get auto-resolved with no action, and rebuilt the monitor estate around SLOs and tier templates until the false-positive rate fell from 70% to 8%.
This skill builds and tunes monitors. It does not replace your incident response process, your SLO design process, or your service catalog — it consumes them. Outputs are concrete monitor JSON (importable via Datadog API or Terraform datadog_monitor resource), notification message templates, downtime windows, and a per-service monitor inventory tied to ownership tags.
Usage
Invoke when:
- Monitors page on every minor blip; on-call comp is rising
- A new T0 service is shipping and needs its monitor pack from day one
- An existing service has 80 monitors and only 6 ever fire
- SLOs were defined in a doc but never wired to alerts
- A postmortem revealed the right monitor existed but conditioned on the wrong tag set
- The Datadog bill jumped because of
avg by:cardinality on a custom metric used in monitors - Monitor message bodies are unparseable; runbook links are missing or rotted
- A monitor fires across all environments because nobody scoped it to
env:prod
Basic invocations:
Design a T0 monitor pack for our checkout service Convert our 99.9% latency SLO into burn-rate alerts Audit our 4,000 monitors and tell me which 80% are noise Write a notification template the on-call doesn't have to decode
Inputs Required
- Datadog org + API key (or a monitor export via
datadog-monitorTerraform / API dump) - Service catalog: name, tier (T0/T1/T2), team owner, runbook URL, dashboard URL
- SLOs in scope (objective, window, error budget) — Datadog SLO objects or external doc
- Routing targets: PagerDuty service per team, Slack channels, on-call schedule
- Existing tag taxonomy:
env,service,team,criticality,version,region - Recent incident list (last 90 days) — for monitor coverage analysis
- Custom metric inventory + cost (Datadog → Plan & Usage → Custom Metrics)
- Constraints: budget limit, custom metric cap, log indexing limits
Workflow
-
Inventory existing monitors. API:
GET /api/v1/monitor(paginate). Tag each monitor with: type, service, tier, fired in last 90d?, ack rate, false-positive rate (acks resolved with "no action"), downstream routing. Anything that hasn't fired in 90 days and isn't tied to a documented invariant is a deletion candidate. -
Classify by tier. Cross-reference the service catalog. Every monitor should be tagged
tier:T0|T1|T2|T3andteam:<owner>. Monitors without a tier or owner go to a triage list. -
Map SLOs to burn-rate alerts. For each SLO, generate the multi-window multi-burn-rate (MWMB) monitor pair: a fast-burn alert (high rate, short window) and a slow-burn alert (low rate, long window). See the SLO recipe section.
-
Apply tier templates. Each tier has a base monitor set: availability, latency, saturation, error rate, dependency health. Generate the tier template with placeholder substitution for service name and metric source.
-
Pick the right monitor type per signal. Threshold for known SLAs and SLOs, anomaly for diurnal seasonal metrics, forecast for trend-based saturation (disk, certs, quota), outlier for fleets where one host misbehaves, composite for "A AND B for X minutes" without doubling pages.
-
Engineer notification messages. Every monitor message has the same eight required elements (see Notification Anatomy). Use Datadog template variables (
{{value}},{{host.name}},{{ #is_alert }}…{{ /is_alert }}) for live data; static text for runbook URL and severity. -
Wire tag-driven routing.
@pagerduty-<service>for P0/P1,@slack-<channel>for P2/P3. Routing is in the message body, scoped by{{ #is_alert }}so resolved-events don't re-page. -
Set downtime windows. Deploy windows, maintenance windows, known-noisy windows. Use
DowntimeAPI with scope filters (env:prod service:checkout); document expected duration. -
Configure no-data behavior.
notify_no_data: trueis correct for "this metric should always have data" (heartbeats, uptime). For sparse metrics,notify_no_data: falseplus a separate uptime monitor. Never default totrueon every monitor — it pages on deploys. -
Group by stable dimensions only.
group by: hoston auto-scaling fleets explodes alert count on scale-up. Group byservice,env,cluster,customer-tier. Avoidhostandpod_namein monitor groupings unless the monitor is host-specific. -
Test the monitor. Dry-run with
Test Notificationsbutton or APIPOST /api/v1/monitor/{id}/notify. Verify routing, message rendering, runbook link, severity. Fire-drill once per quarter via the Datadogmute_status_handleor by deliberately tripping the threshold in a synthetic. -
Document the monitor. Each monitor has a
runbook_urltag and the runbook link in the message. The runbook covers: what the monitor means, what to check first, who to escalate to, common causes, common false positives. -
Schedule audits. Monthly: stale monitor pruning, false-positive review. Quarterly: tier template refresh, SLO re-baselining, cost review.
Monitor Type Decision Tree
Datadog has nine monitor types; most teams use Threshold for everything. That's how you get 4,000 monitors that don't catch real issues. Pick the type that matches the signal shape.
What are you alerting on?
├── A static SLA / SLO threshold (latency < 500ms, error rate < 1%)
│ → Metric Threshold monitor
│
├── A trend that crosses a threshold over time (disk fills, quota exhaust, cert expiry)
│ → Forecast monitor (linear or seasonal forecast)
│
├── A metric with a strong daily/weekly seasonality (traffic, signups)
│ → Anomaly monitor (agile, robust, or basic algorithm)
│
├── One host/pod/instance behaving differently from its peers
│ → Outlier monitor (DBSCAN, MAD, or scaledZ)
│
├── A condition that requires multiple signals to all be true
│ → Composite monitor (AND of two metric monitors)
│
├── An event happening (deploy, security finding, audit log)
│ → Event monitor or Event-V2 monitor
│
├── A log pattern occurring at rate
│ → Log monitor (don't use Threshold on a log-based metric — Log monitor is cheaper)
│
├── An external endpoint being reachable
│ → Synthetic monitor (browser or API test)
│
└── A process / service running on a host
→ Process monitor (legacy) or Service Check monitor
Rules of thumb:
- Anomaly monitors are great for unexpected changes but terrible for known invariants. Use them on traffic, not error rate.
- Forecast monitors require >2 weeks of history; don't use on new metrics.
- Outlier monitors silently break when the fleet has <5 members. Set a min-host gate.
- Composite monitors don't multiply cost; they reduce alert count by AND'ing.
- Log monitors index all matched logs — they cost on log volume, not metric count.
Service Tier Templates
Each tier ships with a fixed monitor pack. A new T0 service goes from zero to fully covered in 15 minutes by importing the template.
T0 Critical (revenue path, auth, payments)
| Monitor | Type | Threshold | Window | Notify |
|---|---|---|---|---|
| Availability (HTTP 5xx rate) | Metric Threshold | >0.5% over 5m | 5m | P0 → PagerDuty (urgent) |
| p99 Latency | Metric Threshold | >1.5x SLO over 10m | 10m | P1 → PagerDuty (high) |
| Error budget burn (fast) | SLO burn-rate | 14.4x burn over 1h | 1h | P0 → PagerDuty (urgent) |
| Error budget burn (slow) | SLO burn-rate | 6x burn over 6h | 6h | P1 → PagerDuty (high) |
| Saturation (CPU/mem) | Forecast | >85% in 24h | 24h forecast | P2 → Slack |
| Dependency health | Composite | upstream availability < 99% AND request rate > 100/s | 5m | P2 → Slack |
| Deploy regression | Anomaly | error rate +3σ post-deploy | 30m | P1 → PagerDuty (high) |
| No-data (heartbeat) | Metric Threshold (notify_no_data) | no data for 5m | 5m | P1 → PagerDuty (high) |
| Cost anomaly (AWS bill tag) | Anomaly | +2σ on 7d window | 24h | P3 → Slack digest |
T1 Important (dashboards, search, internal APIs)
| Monitor | Type | Threshold | Window | Notify |
|---|---|---|---|---|
| Availability | Metric Threshold | >2% over 10m | 10m | P2 → Slack live |
| p95 Latency | Metric Threshold | >2x baseline over 15m | 15m | P3 → Slack digest |
| Error budget burn | SLO burn-rate | 6x burn over 6h | 6h | P2 → Slack live |
| Saturation | Forecast | >90% in 48h | 48h forecast | P3 → Slack digest |
| Deploy regression | Anomaly | error rate +3σ post-deploy | 1h | P2 → Slack live |
| No-data | Metric Threshold | no data for 15m | 15m | P3 → Slack digest |
T2 Best-effort (internal tools, batch jobs, marketing)
| Monitor | Type | Threshold | Window | Notify |
|---|---|---|---|---|
| Availability | Metric Threshold | >5% over 30m | 30m | P3 → Slack digest |
| Job failure (batch) | Event monitor | failed job event | event | P3 → Slack digest |
| Cron heartbeat | Synthetic / heartbeat | missed schedule by 2x interval | 2x | P3 → Slack digest |
T3 (experiments, prototypes) get no monitors. If they need monitoring they're not T3.
SLO-Driven Monitoring Recipe (Multi-Window Multi-Burn-Rate)
The SRE workbook standard. Every SLO turns into two paired alerts: fast-burn (catches fast-burning incidents) and slow-burn (catches slow-burning ones). Single-window burn-rate alerts either page too late or page on noise.
Definitions (assuming 30-day SLO window, 99.9% objective, error budget = 0.1% = 43.2 minutes):
fast_burn_rate = 14.4 # exhaust full budget in (30d / 14.4) = 50 hours
slow_burn_rate = 6 # exhaust in 30d / 6 = 5 days
fast_window = 1h # short evaluation, fast detection
slow_window = 6h # longer evaluation, fewer false positives
The two monitors per SLO:
# Fast-burn (page urgently)
type: slo alert
slo: <slo_id>
threshold:
- critical: 14.4
- warning: 6
threshold_windows:
- critical: 1h
- warning: 5m # short-window check to confirm
notify: "@pagerduty-{{service}}"
message: "🔥 Fast burn: {{value}}x error budget consumption in last 1h"
# Slow-burn (page during business hours)
type: slo alert
slo: <slo_id>
threshold:
- critical: 6
- warning: 3
threshold_windows:
- critical: 6h
- warning: 30m
notify: "@slack-{{team}}"
message: "🐌 Slow burn: {{value}}x error budget consumption in last 6h"
Burn-rate cheatsheet:
| Burn rate | Time to exhaust | Use as |
|---|---|---|
| 1x | 30 days (full window) | baseline (no alert) |
| 2x | 15 days | tracking, no alert |
| 6x | 5 days | slow-burn alert (P2) |
| 14.4x | 50 hours | fast-burn alert (P1) |
| 36x | 20 hours | critical fast-burn (P0) |
For availability SLOs, the good_events / total_events formulation maps directly to Datadog SLOs. For latency SLOs, define good = requests with p95 < threshold, total = all requests. Datadog supports both via SLO objects.
Notification Message Anatomy
A pageable Datadog notification has eight required elements. Anything missing is a footgun for the on-call. Use this template literally.
# 1. SEVERITY + ONE-LINE TITLE
🚨 [P1] checkout-svc p99 latency above SLO
# 2. CURRENT STATE (template variable)
Current value: {{value}}ms (threshold: 1500ms)
Affected scope: {{scope.name}} env:{{env.name}} region:{{region.name}}
# 3. TIME WINDOW
Triggered at {{last_triggered_at}}
Evaluation window: last 10 minutes
# 4. RUNBOOK LINK (HARD REQUIREMENT)
Runbook: https://runbooks.example.com/checkout/p99-latency
Dashboard: https://app.datadoghq.com/dashboard/abc-checkout
# 5. RECENT CHANGES (auto-injected via Datadog Events overlay)
{{ #is_alert }}
Last deploy: {{ event.deploy.version }} at {{ event.deploy.timestamp }}
{{ /is_alert }}
# 6. WHO TO PAGE (routing)
{{ #is_alert }}
@pagerduty-checkout @slack-checkout-oncall
{{ /is_alert }}
{{ #is_recovery }}
@slack-checkout-oncall ← only post recovery to Slack, NOT PagerDuty
{{ /is_recovery }}
# 7. WHAT TO CHECK FIRST (3 bullets max)
- Is upstream payment-svc healthy? (check service map)
- Did a deploy land in the last 30 min? (releases dashboard)
- Are we in a known traffic spike? (traffic dashboard)
# 8. ESCALATION
If unresolved after 15 min, escalate to @platform-team.
Owner: {{tag.team}} | Tier: {{tag.tier}}
Datadog-specific tips:
- Use
{{ #is_alert }}blocks for alert-only content. Recovery messages should only post to Slack, never re-page PD. {{ #is_no_data }}is a separate state — the message should differ fromis_alert.{{ scope.name }}resolves to the grouping; for multi-alerts it's the failing group, for simple alerts it's the whole monitor scope.- Embed runbook + dashboard as hard URLs, not Datadog-internal links — the on-call may be on a phone without Datadog SSO.
Tag Strategy
Tags drive routing, downtime, dashboard filters, and cost attribution. A bad tag taxonomy makes the entire monitor estate brittle. Lock these five tags before designing any monitor.
| Tag | Required | Values | Purpose |
|---|---|---|---|
env | Yes | prod, staging, dev, canary | Scope monitors; never alert on non-prod by default |
service | Yes | checkout, auth, search, ... | Routing, ownership, service map |
team | Yes | payments, platform, data, ... | Routing, on-call mapping |
tier | Yes | T0, T1, T2, T3 | Template inheritance, severity defaults |
version | Yes | git sha or semver | Deploy regression detection, downtime per release |
region | If multi-region | us-east-1, eu-west-1, ... | Scope, regional outage isolation |
customer_tier | If B2B | enterprise, pro, free | Prioritise enterprise-impacting alerts |
Hard rules:
- Every monitor MUST scope to
env:prod(or explicit env) — no environment-blind monitors. - Every monitor MUST have
serviceandteamtags so routing works. - Don't tag with anything that has unbounded cardinality (
user_id,request_id,session_id) — they don't help monitors and they explode metric cost. - Tag values are case-sensitive in Datadog:
Service:Checkout≠service:checkout. Lowercase always.
Composite Monitor Patterns
Composite monitors (monitor_a && monitor_b) reduce false positives by requiring multiple signals. They're underused because most teams default to standalone thresholds.
Pattern 1 — Error rate AND traffic floor. A 100% error rate on 3 requests/min is noise; on 10k requests/min it's an outage.
composite: error_rate > 5% AND request_rate > 100/sec for 5min
Pattern 2 — Latency AND deploy correlation. p99 latency rising AND a recent deploy = regression. Either alone is normal traffic variance.
composite: p99_latency > SLO * 1.5 AND deploys_in_last_30m > 0
Pattern 3 — Multi-region quorum. Alert only when 2 of 3 regions are degraded.
composite: us_east_errors > 1% AND eu_west_errors > 1%
Pattern 4 — Saturation AND queue depth. CPU high alone is fine if work is getting done; CPU high AND queue growing = real problem.
composite: cpu_avg > 80% AND queue_depth > queue_depth_baseline * 2
Pattern 5 — Dependency degraded AND own service degraded. Don't alert when the upstream is down (alert the upstream's owners); only alert when the upstream's degradation impacts you.
composite: upstream_error_rate > 1% AND own_p99 > SLO
Mute the children. When a composite owns the signal, mute the underlying monitors so they don't double-page. Use Datadog's mute_status_handle or set the child monitor priority to "tracked, no notification."
Downtime, Maintenance, and Mute Patterns
Pages during planned maintenance burn trust. Schedule downtime — don't tell humans to ignore alerts.
Recurring downtime for deploy windows:
type: downtime
scope: "env:prod service:checkout deploy:active"
recurrence: rrule "FREQ=WEEKLY;BYDAY=TU;BYHOUR=14"
duration: 30m
message: "Tuesday deploy window — alerts suppressed"
One-shot downtime for DB upgrade:
scope: "env:prod database:rds-prod-001"
start: 2026-05-10T02:00:00Z
end: 2026-05-10T04:00:00Z
mute_first_recovery_notification: true # don't celebrate recovery while still in maintenance
Smart mute for canary deploys: mute the canary's tag for 30 minutes after deploy, but keep the rest of prod alerting normally.
scope: "env:prod canary:true"
duration: 30m
trigger: post-deploy webhook
Mute on incident: during an active P0 incident, mute downstream child alerts so the war room isn't flooded with secondary effects.
scope: "incident:INC-1234" # via tag pushed to all downstream services
auto-unmute: 30 min after incident.resolved=true
Anti-pattern: human-applied mutes. "I'll mute it for an hour" gets forgotten. Always set an explicit end time.
No-Data, Sparse-Data, and Heartbeat Monitors
Datadog's notify_no_data and no_data_timeframe are the most-misconfigured options.
Decision matrix:
| Signal | notify_no_data | no_data_timeframe | Notes |
|---|---|---|---|
| Heartbeat (cron, scheduled job) | true | 2x interval | "Job runs every 1h" → no_data_timeframe = 2h |
| Always-on service traffic | true | 10m | Genuine "no traffic" is an outage |
| Sparse error metric | false | n/a | No errors = no data ≠ alert |
| User-action metric (signups) | false | n/a | Quiet hours are normal |
| Saturation metric | true | 30m | Agent died = no data = blind |
| Synthetic check | true | 5m | Synthetic failure = no test = page |
Heartbeat pattern (the right way to monitor a cron):
metric: my.cron.heartbeat (statsd counter, sent at end of each run)
monitor: max(last_2h):default(my.cron.heartbeat{job:nightly-export}, 0) < 1
notify_no_data: true
no_data_timeframe: 120
The default(metric, 0) trick converts "no data" into "value 0" so the threshold fires reliably. More robust than notify_no_data alone, which has subtle behaviour around evaluation windows.
Cost Levers (Custom Metrics, Indexed Logs, Ingested Traces)
Datadog billing has three big levers. Monitors interact with all of them.
Custom metrics: billed per unique timeseries (metric name × tag combo) per month. A monitor avg by (service, customer_id) over (...) with 10,000 customers = 10,000 timeseries per metric. Stop tagging by anything user-cardinality. Use service, env, region only.
Indexed logs: billed per log indexed (not just ingested). Log monitors index matched logs. A pattern like * indexes every log = bankruptcy. Always scope log monitors to a specific service + level + pattern.
Ingested traces: billed per million spans. Trace-based monitors (APM monitors) consume ingestion budget. Use head-based sampling to drop non-error traces; tail-based for nuanced sampling on errors.
The 80/20 cost audit:
Plan & Usage → Custom Metrics— top 10 metrics by timeseries count- For each: which monitor uses it? Drop unused, refactor high-cardinality
Plan & Usage → Logs— top 10 indexes by volume; cut indexing rules to scopePlan & Usage → APM— drop service catalog entries for retired services
Anti-patterns
- Threshold monitor on every metric. Most metrics need anomaly or forecast. Threshold pages on every diurnal pattern.
group_by: hoston autoscaling fleets. Scale-up doubles your monitor count overnight; scale-down breaks no-data evaluation. Group by service, not host.notify_no_data: trueeverywhere. Pages on every deploy, network blip, agent restart. Use only for true heartbeats with a separate uptime monitor.- No
env:prodscope. Monitor fires on dev's broken laptop. Always scope to env. - Same monitor across all environments. Prod thresholds are not staging thresholds. Clone with template variables.
- Single-window burn-rate alert. Either pages too late (long window) or on noise (short window). Always use multi-window multi-burn-rate.
- Custom metric in monitor with high tag cardinality. Each unique tag combo is a custom metric — Datadog charges per timeseries.
user_idin a monitor scope = bankruptcy. - Notification message with no runbook link. On-call has no idea what to do. Runbook URL is mandatory.
- Auto-resolving without auto-recovery message. People don't trust resolutions. Send recovery to Slack with
{{ #is_recovery }}block. - Composite monitor of three child monitors that also alert independently. Both children fire AND the composite fires — triple page. Mute children when composite owns the signal.
- No downtime windows for known maintenance. PagerDuty fires during your scheduled DB upgrade. Schedule downtime via API, not "I'll mute it manually."
- Monitor sprawl with no owner tag. When the team disbands, monitors live forever, alerting nobody. Owner tag is mandatory; orphans get auto-deleted after 30 days.
- Anomaly monitors on error rate. Errors are not seasonal; anomaly detects "different from baseline" which is what threshold already does. Use anomaly only on traffic-shaped metrics.
- Forecast monitor on a 3-day-old metric. Forecast needs >2 weeks of history. New metrics produce wild forecasts.
- Outlier monitor on fleets of <5. With 4 hosts, "outlier" means "1 of 4," which is just "one host has a problem." Use threshold instead.
- Same alert message across 80 monitors. Engineers stop reading. Each monitor's message is unique to that signal, with the right runbook link.
- PagerDuty integration key shared across services. All alerts route to the same PD service; routing inside PD becomes guesswork. One Datadog
@pagerduty-Xper service. avg overinstead ofmin overfor SLO checks. Avg masks brief spikes. SLO breaches are about thresholds being crossed, not averages — usemin/maxfor breach detection.- Editing monitors in UI without Terraform sync. Drift between code and reality. All monitor edits via Terraform
datadog_monitorresource or via API with audit log. - Forgetting
evaluation_delay. When metrics arrive 30s late (common with batch ingestion), an alert evaluating the last minute sees zero data. Setevaluation_delay: 60to compensate. - Treating monitor count as a KPI. "We have 4,000 monitors" is bad, not good. The right number is "every signal that matters has exactly one monitor."
Exit Criteria
- Every monitor in the org tagged with
env,service,team,tier,runbook_url - Tier templates (T0/T1/T2) imported and applied to every service in the catalog
- Every SLO has a paired multi-window multi-burn-rate alert (fast + slow)
- Notification messages follow the eight-element template; runbook + dashboard URLs present
- Routing matches severity: P0/P1 → PagerDuty, P2/P3 → Slack live/digest
- Downtime windows scheduled for all known maintenance and deploy windows
- No-data behavior set per monitor based on signal sparsity (no global default)
- All
group_byclauses use stable dimensions; nohost/pod_namegroupings on autoscaling fleets - Custom metric usage audited; high-cardinality offenders refactored or dropped
- Monitor inventory diff: ≥40% reduction in count, ≥70% reduction in noisy pages, no measurable miss in incident detection (verified against last 90 days)
- Monthly recurring audit scheduled with a named owner per service
- Quarterly tier template refresh cadence in calendar
- New-service onboarding doc references the tier template import command (Terraform / API)
- All monitors have a corresponding entry in the service catalog dashboard for visibility