Google Cloud Production Rules

Cost Traps

Stopped Compute Engine VMs still pay for persistent disks and static IPs — delete disks or use snapshots for long-term storage
Cloud NAT charges per VM and per GB processed — use Private Google Access for GCP API traffic instead
BigQuery on-demand pricing charges for bytes scanned, not rows returned — partition tables and use LIMIT in dev, but LIMIT doesn't reduce scan cost in prod
Preemptible VMs save 80% but can be terminated anytime — only for fault-tolerant batch workloads
Egress to internet costs, egress to same region is free — keep resources in same region, use Cloud CDN for global distribution

Service accounts are both identity and resource — one service account can impersonate another with roles/iam.serviceAccountTokenCreator
IAM policy inheritance: Organization → Folder → Project → Resource — deny policies at org level override allows below
VPC Service Controls protect against data exfiltration — but break Cloud Console access if not configured with access levels
Default Compute Engine service account has Editor role — create dedicated service accounts with least privilege
Workload Identity Federation eliminates service account keys — use for GitHub Actions, GitLab CI, external workloads

VPC is global, subnets are regional — unlike AWS, single VPC can span all regions
Firewall rules are allow-only by default — implicit deny all ingress, allow all egress. Add explicit deny rules for egress control
Private Google Access is per-subnet setting — enable on every subnet that needs to reach GCP APIs without public IP
Cloud Load Balancer global vs regional — global for multi-region, but regional is simpler and cheaper for single region
Shared VPC separates network admin from project admin — host project owns network, service projects consume it

Cloud Functions gen1 has 9-minute timeout — gen2 (Cloud Run based) allows 60 minutes
Cloud SQL connection limits vary by instance size — use connection pooling or Cloud SQL Auth Proxy
Firestore/Datastore hotspotting on sequential IDs — use UUIDs or reverse timestamps for document IDs
GKE Autopilot simplifies but limits — no DaemonSets, no privileged containers, no host network
Cloud Storage single object limit is 5TB — use compose for larger, parallel uploads for faster

Cloud Logging retention: 30 days default, _Required bucket is 400 days — create custom bucket with longer retention for compliance
Cloud Monitoring alert policies have 24-hour auto-close — incident disappears even if issue persists, configure notification channels for re-alert
Error Reporting groups by stack trace — same error with different messages creates duplicates
Cloud Trace sampling is automatic — may miss rare errors, increase sampling rate for debugging
Audit logs: Admin Activity always on, Data Access off by default — enable Data Access logs for security compliance

Terraform google provider requires project ID everywhere — use google_project data source or variables, never hardcode
gcloud commands are imperative — use Deployment Manager or Terraform for reproducible infra
Cloud Build triggers on push but IAM permissions on first run confusing — grant Cloud Build service account necessary roles before first deploy
Project deletion has 30-day recovery period — but project ID is globally unique forever, can't reuse
Labels propagate to billing — use consistent labeling for cost allocation: env, team, service

Primitive roles (Owner/Editor/Viewer) are too broad — use predefined roles, create custom for least privilege
Service account keys are security liability — use Workload Identity, impersonation, or attached service accounts instead
roles/iam.serviceAccountUser lets you run as that SA — equivalent to having its permissions, grant carefully
Organization policies restrict what projects can do — constraints/compute.vmExternalIpAccess blocks public VMs org-wide