Terraform Plan Reviewer

Review terraform plan (or tofu plan, or a Terragrunt run-all) and decide whether the apply is safe. Categorizes every change by destructiveness, drift, secret exposure, IAM blast radius, and provider-specific risk. Returns a PR-comment-style review with severity grades, gating decision, and line-cited suggested fixes. Acts as a senior platform engineer who has approved thousands of applies and reverted the bad ones.

Usage

Invoke this skill before any non-trivial terraform apply, in CI as a gate on PRs, or as a manual second opinion on an environment-affecting change.

Basic invocation:

Review this terraform plan: [paste plan output] Gate this PR — should we apply or block? Audit my Terragrunt run-all output for drift

With context:

Here's the plan and the previous state file — flag any IAM widening We're about to apply this to prod — what's the blast radius? Generate a GitHub Action that runs this review on every PR

The agent produces a graded review (P0/P1/P2 findings), a single apply gating decision (BLOCK / APPROVE-WITH-CONDITIONS / APPROVE), and per-finding suggested fixes with line numbers from the plan.

How It Works

Step 1: Parse The Plan

The agent normalizes input from any of:

Human-readable terraform plan output (most common)
JSON plan: terraform show -json plan.out (preferred — unambiguous)
Terragrunt aggregated run-all plan output
OpenTofu tofu plan (treated identically to terraform)

JSON form is preferred. When only text is available, the agent extracts:

+ create        = N resources
~ update        = N resources
-/+ replace     = N resources    (destroy then create)
+/- replace     = N resources    (create then destroy — usually safer)
-  destroy      = N resources
<= read         = N data sources

The agent also captures resource addresses (module.x.aws_iam_role.y), change reasons (# forces replacement), and any drift markers (# (config refers to values not yet known), Note: Objects have changed outside of Terraform).

Step 2: Severity Matrix

Every change in the plan gets exactly one severity. The matrix is the contract:

Severity	Meaning	CI Default
P0 — Block	Apply will cause data loss, outage, security breach, or unauthorized access.	Hard block. Human override required.
P1 — Require Approval	High risk; reasonable in some contexts but needs second pair of eyes.	Require named reviewer + comment justification.
P2 — Advisory	Low risk; flagged for awareness.	Allow apply; record in review.
OK	Standard, low-risk change.	No flag.

Step 3: Detection Rules

The agent runs every change through a layered ruleset. A single change may fire multiple rules; the highest severity wins.

D — Destructive change rules:

D1.  destroy on aws_db_instance, aws_rds_cluster, gcp_sql_database_instance,
     azurerm_mssql_database, azurerm_postgresql_server                     → P0
D2.  destroy on aws_s3_bucket, gcs_bucket, azurerm_storage_account
     when not empty (force_destroy=false)                                  → P0
D3.  destroy on stateful workloads (statefulset, persistent volume,
     elasticache, redis, kafka, mq broker)                                 → P0
D4.  -/+ replace on a database, cache, queue, or persistent disk           → P0
D5.  -/+ replace on a load balancer with active traffic                    → P1
D6.  -/+ replace on an IAM role / service account                          → P1
     (breaks every principal that assumes it)
D7.  destroy on a route53 / cloud DNS zone                                 → P0
D8.  destroy on a KMS key, cmek, key vault key                             → P0
     (data encrypted with it becomes unrecoverable)
D9.  destroy + recreate on a network resource carrying production traffic
     (vpc, subnet, vpc peering, transit gateway, vpn)                      → P0
D10. lifecycle.prevent_destroy=true bypassed by a destroy                  → P0

S — Secret / output leak rules:

S1.  output without `sensitive = true` containing a value matching
     password|secret|token|key|credential|api[_-]?key                      → P1
S2.  resource attribute set from a hardcoded string matching the
     secret regex (in plan diff, not just var)                             → P0
S3.  user_data / cloud-init / startup script contains a credential
     string visible in plan diff                                           → P0
S4.  tfvars/state file referenced from a public S3 bucket / public
     storage container / public repo                                       → P0
S5.  KMS / KeyVault key policy widens decrypt to "*"                       → P0

I — IAM widening rules:

I1.  Action: "*" or Resource: "*" on a new policy statement              → P0
I2.  Wildcard service principal in trust policy
     (Principal: {"AWS": "*"} or sts:AssumeRoleWithSAML to *)              → P0
I3.  iam:PassRole with Resource:"*" and no condition on iam:PassedToService→ P0
I4.  s3:* / dynamodb:* / kms:* added to a previously scoped policy        → P1
I5.  AdministratorAccess / Owner / Editor role assignment to a user
     that previously had a narrower role                                   → P1
I6.  GCP roles/iam.serviceAccountUser added without resource restriction → P1
I7.  Azure RoleAssignment scope = subscription or management-group        → P1
I8.  Conditions removed from an existing IAM policy (less restriction)    → P1
I9.  MFA condition removed from an assume-role trust policy               → P0

N — Public exposure rules:

N1.  S3 bucket public-access-block disabled OR ACL set to public-read    → P0
N2.  RDS / Cloud SQL / Azure SQL publicly_accessible = true               → P0
N3.  Security group ingress 0.0.0.0/0 on port not in {80, 443}            → P0
N4.  GCS bucket IAM grants `allUsers` or `allAuthenticatedUsers`          → P0
N5.  Azure Storage account public_network_access = "Enabled"             → P1
N6.  ELB / ALB scheme = internet-facing on a service that previously was
     internal                                                              → P1
N7.  Cloud Functions / Lambda invocation policy adds Principal: "*"      → P0
N8.  GKE / EKS / AKS cluster endpoint becomes public                     → P0
N9.  VPC subnet route added to internet gateway with 0.0.0.0/0            → P1

X — Drift & state rules:

X1.  Plan shows changes you didn't intend (drift) — values "changed
     outside of Terraform"                                                  → P1
X2.  Resource referenced in code but absent from state (needs import)    → P1
X3.  Resource in state with no matching code (needs `terraform state rm`
     or restore to code)                                                    → P1
X4.  Provider version pinning loosened (~> to >=)                         → P1
X5.  Backend config changed (state moves region/bucket/account)          → P0
X6.  count or for_each shifts re-index a list, causing mass replacement  → P0

P — Provider-specific footguns (AWS):

PA1. aws_launch_configuration deprecated → P2 (use launch_template)
PA2. aws_security_group rule moved into inline vs aws_security_group_rule
     resource without explicit replace = silent rule loss                 → P1
PA3. aws_lb_target_group port change forces recreate                     → P1
PA4. aws_eks_cluster version downgrade attempted                         → P0
PA5. aws_db_instance allocated_storage decreased                         → P0
PA6. aws_iam_role assume_role_policy replaced (vs updated)               → P1
PA7. aws_acm_certificate replaced when an LB still references the old   → P0

P — Provider-specific footguns (GCP):

PG1. google_sql_database_instance deletion_protection=false             → P1
PG2. google_compute_instance metadata_startup_script changed (replaces) → P1
PG3. google_project_iam_member vs _binding mixed (binding wipes others) → P0
PG4. google_kms_crypto_key with skip_initial_version_creation=true and
     destroy_scheduled_duration<7d                                         → P0
PG5. google_container_cluster master_auth changes force in-place        → P1

P — Provider-specific footguns (Azure):

PZ1. azurerm_resource_group destroy (cascades all child resources)      → P0
PZ2. azurerm_key_vault soft_delete_enabled toggled off                 → P0
PZ3. azurerm_storage_account replication type change forces destroy    → P0
PZ4. azurerm_role_assignment with scope at subscription level          → P1
PZ5. azurerm_virtual_machine deprecated (use _linux/_windows variant)  → P2

Step 4: Apply Gating

The agent renders a single decision based on the highest-severity finding:

ANY P0           → BLOCK (do not apply)
ANY P1, no P0    → APPROVE_WITH_CONDITIONS (named reviewer + justification)
P2 only          → APPROVE (with advisory comments)
no findings      → APPROVE (clean plan)

Conditions on APPROVE_WITH_CONDITIONS include: a specific named reviewer, a deploy window restriction (no Friday afternoons), a maintenance-window flag, paired apply (two engineers on call), or a pre-apply backup snapshot.

Step 5: Drift Detection

Drift is the silent killer of IaC. The agent surfaces three categories:

Category	Marker	Action
Config drift	"changed outside of Terraform"	Flag P1; suggest `terraform refresh` then re-plan; or import the manual change as code
State drift	resource missing from state but in code	Flag P1; provide `terraform import` command
Code drift	resource in state, removed from code	Flag P1; suggest `terraform state rm` if intentional, otherwise restore the resource block

The agent generates the import command with the address and the cloud-side resource id whenever the cloud-side id can be inferred from the plan.

Step 6: Comment-Style Review Output

The output mimics a senior reviewer's PR comment, formatted for direct paste into GitHub / GitLab / Bitbucket:

## Terraform Plan Review

**Decision:** BLOCK

**Findings:** 2 × P0, 1 × P1, 3 × P2

---

### P0 — Destroying production RDS
`aws_db_instance.payments_prod` is being destroyed. This is a
`stateful + production` resource. Apply will lose all data not
captured in a manual snapshot.

```diff
-resource "aws_db_instance" "payments_prod" {
-  identifier = "payments-prod"
-  ...
-}

Fix options

Add lifecycle { prevent_destroy = true } and re-plan; investigate why the resource block was removed from code.
If the destruction is intentional (decommission), confirm a final snapshot exists: aws rds describe-db-snapshots --db-instance-identifier payments-prod.
If the resource was renamed, run terraform state mv instead of destroy + create.

P0 — IAM wildcard introduced

aws_iam_policy.deploy adds "Action": "*" on "Resource": "*". This grants the role full account access.

+      Action   = "*"
+      Resource = "*"

Fix: scope to the specific actions the deploy job actually needs.

P1 — Secret in output

Output db_password lacks sensitive = true. Plan will print the value to logs.

Fix:

output "db_password" {
  value     = aws_db_instance.x.password
  sensitive = true
}

P2 — Provider version loosened

hashicorp/aws constraint moved from ~> 5.40 to >= 5.0. Future plans may pull a major version. Pin tightly.

Summary

Apply is blocked until the two P0 findings are addressed. Recommended sequencing:

Restore aws_db_instance.payments_prod block, or land a decommission RFC with snapshot evidence.
Replace the wildcard IAM policy with a scoped policy.
Re-run plan and request review.


### Step 7: CI Integration

The agent emits a ready-to-commit GitHub Action / GitLab CI job:

```yaml
# .github/workflows/terraform-plan-review.yml
name: Terraform Plan Review
on:
  pull_request:
    paths: ['terraform/**', '**.tf', '**.tfvars']

jobs:
  plan-review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
        with: { terraform_version: 1.7.5 }
      - run: terraform -chdir=terraform init -backend=false
      - id: plan
        run: |
          terraform -chdir=terraform plan -out=plan.out -no-color
          terraform -chdir=terraform show -json plan.out > plan.json
      - id: review
        uses: ./.github/actions/tf-plan-review
        with:
          plan-json: terraform/plan.json
          gating: strict       # block | strict | advisory
      - if: steps.review.outputs.decision == 'BLOCK'
        run: |
          echo "::error::Plan review blocked the apply"
          exit 1
      - uses: actions/github-script@v7
        with:
          script: |
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo:  context.repo.repo,
              issue_number: context.issue.number,
              body: require('fs').readFileSync('review.md','utf8')
            });

For Atlantis, Spacelift, env0, and Terraform Cloud — the agent emits the equivalent webhook / pre-apply hook config.

Step 8: Cloud-Specific Risk Lenses

The agent reads each plan with cloud-aware eyes:

AWS lens:

IAM trust-policy changes carry the highest blast radius
KMS key policy widening can grant cross-account decrypt — always P0
Security group changes need effective rules diff, not just resource diff (a removed rule from one SG may be re-added in an inline rule on another)
Route53 zone changes are slow to revert; mistakes propagate via DNS TTL

GCP lens:

google_project_iam_binding overwrites all members on a role — P0 if mixed with _member resources for the same role
KMS key versions in STATE_PENDING_DELETE cannot be recovered after destroy_scheduled_duration — minimum 7 days
Cloud SQL deletion_protection should be true in prod; a plan toggling it off is P1

Azure lens:

Resource group destroys cascade silently — every child resource is gone
Key Vault soft_delete_enabled and purge_protection_enabled interact; turning off either is P0
Storage Account account_replication_type change forces destroy + create
Role assignments at subscription scope are the equivalent of AWS account-level Admin — P1 minimum

Step 9: Backup & Pre-Apply Checklist

For any plan with P1+ findings, the agent emits a pre-apply checklist:

[ ] State backed up (s3 cp / gsutil cp / az storage blob copy)
[ ] Manual DB snapshot taken for any stateful resource being touched
[ ] Maintenance window confirmed with stakeholders
[ ] On-call paged or notified
[ ] Rollback plan documented in PR description
[ ] Apply executed by named engineer (not bot)
[ ] Post-apply: drift re-checked within 30 min
[ ] Post-apply: smoke tests run against affected services

For P0 plans the checklist is moot — the gating is BLOCK and the plan is rejected.

Step 10: Output Modes

The agent supports four output modes for different integration points:

Mode	Format	Where It Plugs In
review-comment	Markdown PR comment with diffs + fixes	GitHub/GitLab PR
gating-decision	JSON `{decision, p0:[], p1:[], p2:[]}`	CI scripts
slack-summary	One-screen Slack post with deep links	Deploy channel
runbook	Markdown checklist + commands	Pre-apply prep

Worked Examples

Example 1: Production Plan — Block

Input (excerpt):

# aws_db_instance.payments_prod will be destroyed
- resource "aws_db_instance" "payments_prod" {
-     identifier = "payments-prod"
-     engine     = "postgres"
-     ...
- }

# aws_iam_role_policy.deploy will be updated in-place
~ resource "aws_iam_role_policy" "deploy" {
~     policy = jsonencode(
~       {
~         Statement = [
~           {
~             Action   = "*"
~             Resource = "*"
~           },
~         ]
~       },
~     )
}

# aws_db_instance.payments_prod will be created
+ resource "aws_db_instance" "payments_prod" {
+     identifier = "payments-prod"
+     ...
+ }

Wait — same identifier, destroy then create? That's a -/+ masquerading as separate destroy + create blocks because the resource address moved between modules. Rule D4 fires (P0).

Decision: BLOCK P0: D4 (DB recreate), I1 (IAM wildcard)

Example 2: Drift With Import Suggestion

Input:

Note: Objects have changed outside of Terraform

  # aws_security_group.web has changed
  ~ resource "aws_security_group" "web" {
        id          = "sg-0a1b2c3d"
      ~ ingress     = [
          + {
              + cidr_blocks = ["192.168.10.0/24"]
              + from_port   = 22
              + to_port     = 22
              + protocol    = "tcp"
            },
          ...
        ]
    }

# aws_security_group_rule.ssh will be created
+ resource "aws_security_group_rule" "ssh" {
+     from_port   = 22
+     to_port     = 22
+     protocol    = "tcp"
+     cidr_blocks = ["10.0.0.0/8"]
+ }

Findings:

X1 (P1): drift — someone added a 192.168.10.0/24 SSH rule outside Terraform
X2 (P1): the new code adds a 10.0.0.0/8 rule, which conflicts with the manually-added one and will not remove it

Suggested fixes:

Import the manual rule first: terraform import aws_security_group_rule.adhoc_ssh sg-0a1b2c3d_ingress_tcp_22_22_192.168.10.0/24
Decide whether to keep both rules or consolidate
Re-plan and re-review

Decision: APPROVE_WITH_CONDITIONS — apply will succeed but state will remain inconsistent until the import is done.

Example 3: Clean Plan

Input (excerpt):

# aws_lambda_function.x will be updated in-place
~ environment {
~     variables = {
~         LOG_LEVEL = "info" -> "debug"
    }
}

Decision: APPROVE Findings: none Comment: "Plan is clean. Single in-place env var update on a Lambda. No P-level findings. Apply approved."

Output

The agent produces:

Severity-graded review — every change classified P0/P1/P2/OK with rule citations
Single gating decision — BLOCK / APPROVE_WITH_CONDITIONS / APPROVE
PR-comment markdown — paste-ready into the PR thread
Suggested fixes — code diffs and commands per finding
Pre-apply checklist — for P1+ plans
Drift triage — import / state-mv / state-rm commands per drift entry
CI workflow — ready-to-commit YAML for GitHub Actions / GitLab CI / Atlantis
Slack summary — one-screen status for the deploy channel
Cloud-specific notes — AWS / GCP / Azure footguns relevant to this plan

Common Scenarios

"We use Terragrunt run-all — there's 40 modules"

The agent processes each module's plan independently, then aggregates: per-module decision plus a cross-module dependency check (e.g. module A destroys a KMS key that module B's plan still references). Cross-module references that fail produce an additional P0.

"The plan shows no changes but production is different"

Drift outside Terraform. The agent runs through terraform plan -refresh-only mentally and flags X1 across the board. Fix: refresh-only apply, then audit which fields were touched and bring them under code or back to declared state.

"Should we apply during business hours?"

Depends on findings. Clean P2-only plans: yes. P1 plans involving load balancers, DNS, or stateful resources: schedule for low-traffic windows. P0: never (and gate blocks anyway).

"How do we handle plans with no JSON, only text?"

The agent's text parser handles standard plan output. JSON gives 100% accuracy on resource addresses and change reasons; text gives ~95% — the gap is mostly in nested module addresses and dynamic blocks.

"Atlantis already gates — why use this?"

Atlantis enforces who can apply, not whether the plan is safe. This skill is the safety review; Atlantis is the policy enforcement. They stack.

Tips for Best Results

Provide the JSON plan (terraform show -json plan.out) when possible — disambiguates module addresses and change reasons
Share the previous state file (or its summary) when reviewing drift — distinguishes "drift since last apply" from "first-time import"
State the environment (prod / staging / dev) before review — gating thresholds shift; a P1 in prod may be a P2 in dev
Mention any in-flight infra changes outside this PR — concurrent applies can produce false-positive drift
Specify the cloud (AWS / GCP / Azure) explicitly — provider-detection is reliable but ambiguous for multi-cloud plans
Indicate the apply executor (CI bot vs human, Atlantis vs Terraform Cloud vs spacelift) — gating recommendations adapt

When NOT To Use

Pre-merge code review of HCL without an actual plan — use tflint / tfsec / checkov for static analysis; this skill needs the rendered plan to grade destructiveness
Ad-hoc CLI applies on a laptop without state in version control — fix the workflow first; reviewing a plan from an unknown state is reviewing fiction
Plans against an empty state (initial bootstrap) — every resource is a + create; the destruction matrix doesn't apply, run a separate bootstrap-review skill
Pulumi / CDKTF / CDK plans — output format differs; this skill is HCL/JSON-plan specific. Use a CDK-specific reviewer
Helm / Kustomize / kubectl-apply — Kubernetes-native deploys have a different risk model; use a k8s-manifest reviewer
Drift-detection-only runs without intent to apply — those need a different output (drift report), not a gating decision