GCP Observability Best Practices
Structured Logging
JSON Log Format
Use structured JSON logging for better queryability:
{ "severity": "ERROR", "message": "Payment failed", "httpRequest": { "requestMethod": "POST", "requestUrl": "/api/payment" }, "labels": { "user_id": "123", "transaction_id": "abc" }, "timestamp": "2025-01-15T10:30:00Z" }
Severity Levels
Use appropriate severity for filtering:
-
DEBUG: Detailed diagnostic info
-
INFO: Normal operations, milestones
-
NOTICE: Normal but significant events
-
WARNING: Potential issues, degraded performance
-
ERROR: Failures that don't stop the service
-
CRITICAL: Failures requiring immediate action
-
ALERT: Person must take action immediately
-
EMERGENCY: System is unusable
Log Filtering Queries
Common Filters
By severity
severity >= WARNING
By resource
resource.type="cloud_run_revision" resource.labels.service_name="my-service"
By time
timestamp >= "2025-01-15T00:00:00Z"
By text content
textPayload =~ "error.*timeout"
By JSON field
jsonPayload.user_id = "123"
Combined
severity >= ERROR AND resource.labels.service_name="api"
Advanced Queries
Regex matching
textPayload =~ "status=[45][0-9]{2}"
Substring search
textPayload : "connection refused"
Multiple values
severity = (ERROR OR CRITICAL)
Metrics vs Logs vs Traces
When to Use Each
Metrics: Aggregated numeric data over time
-
Request counts, latency percentiles
-
Resource utilization (CPU, memory)
-
Business KPIs (orders/minute)
Logs: Detailed event records
-
Error details and stack traces
-
Audit trails
-
Debugging specific requests
Traces: Request flow across services
-
Latency breakdown by service
-
Identifying bottlenecks
-
Distributed system debugging
Alert Policy Design
Alert Best Practices
-
Avoid alert fatigue: Only alert on actionable issues
-
Use multi-condition alerts: Reduce noise from transient spikes
-
Set appropriate windows: 5-15 min for most metrics
-
Include runbook links: Help responders act quickly
Common Alert Patterns
Error rate:
-
Condition: Error rate > 1% for 5 minutes
-
Good for: Service health monitoring
Latency:
-
Condition: P99 latency > 2s for 10 minutes
-
Good for: Performance degradation detection
Resource exhaustion:
-
Condition: Memory > 90% for 5 minutes
-
Good for: Capacity planning triggers
Cost Optimization
Reducing Log Costs
-
Exclusion filters: Drop verbose logs at ingestion
-
Sampling: Log only percentage of high-volume events
-
Shorter retention: Reduce default 30-day retention
-
Downgrade logs: Route to cheaper storage buckets
Exclusion Filter Examples
Exclude health checks
resource.type="cloud_run_revision" AND httpRequest.requestUrl="/health"
Exclude debug logs in production
severity = DEBUG
Debugging Workflow
-
Start with metrics: Identify when issues started
-
Correlate with logs: Filter logs around problem time
-
Use traces: Follow specific requests across services
-
Check resource logs: Look for infrastructure issues
-
Compare baselines: Check against known-good periods