Cross-Model Code Review with Codex CLI

Cross-model validation using the codex binary directly. Claude writes code, Codex reviews it — different architecture, different training distribution, no self-approval bias.

Core insight: Single-model self-review is systematically biased. Cross-model review catches different bug classes because the reviewer has fundamentally different blind spots than the author.

Prerequisite: The codex CLI must be installed and authenticated. Verify with codex --help. User defaults are configured in ~/.codex/config.toml — respect those defaults and never override model or reasoning effort unless the user explicitly requests it.

Two Ways to Invoke Codex

Mode	Command	Best For
`codex review`	Structured diff review with prioritized findings	Pre-PR reviews, commit reviews, WIP checks
`codex exec`	Freeform non-interactive deep-dive with full prompt control	Security audits, architecture review, focused investigation

Key flags:

Flag	Applies To	Purpose
`--base <BRANCH>`	`review` only	Diff against base branch
`--commit <SHA>`	`review` only	Review a specific commit
`--uncommitted`	`review` only	Review working tree changes
`-c model="<model>"`	both	Override default model (only if user requests)
`-m <model>`	`exec` only	Model override shorthand

Review Patterns

Pattern 1: Pre-PR Full Review (Default)

The standard review before opening a PR. Use for any non-trivial change.

Step 1 — Structured review (catches correctness + general issues):
  Run via Bash:
    codex review --base main

Step 2 — Security deep-dive (if code touches auth, input handling, or APIs):
  Run via Bash:
    codex exec "<security prompt from references/prompts.md>"

Step 3 — Fix findings, then re-review:
  Run via Bash:
    codex review --base main

Pattern 2: Commit-Level Review

Quick check after each meaningful commit.

codex review --commit <SHA>

Pattern 3: WIP Check

Review uncommitted work mid-development. Catches issues before they're baked in.

codex review --uncommitted

Pattern 4: Focused Investigation

Surgical deep-dive on a specific concern (error handling, concurrency, data flow).

codex exec \
  "Analyze [specific concern] in the changes between main and HEAD.
   For each issue found: cite file and line, explain the risk,
   suggest a concrete fix. Confidence threshold: only flag issues
   you are >=70% confident about."

Pattern 5: Ralph Loop (Implement-Review-Fix)

Iterative quality enforcement — implement, review, fix, repeat. Max 3 iterations.

Iteration 1:
  Claude -> implement feature
  Bash: codex review --base main -> findings
  Claude -> fix critical/high findings

Iteration 2:
  Bash: codex review --base main -> verify fixes + catch remaining
  Claude -> fix remaining issues

Iteration 3 (final):
  Bash: codex review --base main -> clean or accept trade-offs

STOP after 3 iterations. Diminishing returns beyond this.

Multi-Pass Strategy

For thorough reviews, run multiple focused passes instead of one vague pass. Each pass gets a specific persona and concern domain.

Pass	Focus	Mode
Correctness	Bugs, logic, edge cases, race conditions	`codex review`
Security	OWASP Top 10:2025, injection, auth, secrets	`codex exec` with security prompt
Architecture	Coupling, abstractions, API consistency	`codex exec` with architecture prompt
Performance	O(n^2), N+1 queries, memory leaks	`codex exec` with performance prompt

Run passes sequentially. Fix critical findings between passes to avoid noise compounding.

When to use multi-pass vs single-pass:

Change Size	Strategy
< 50 lines, single concern	Single `codex review`
50-300 lines, feature work	`codex review` + security pass
300+ lines or architecture change	Full 4-pass
Security-sensitive (auth, payments, crypto)	Always include security pass

Decision Tree: Which Pattern?

digraph review_decision {
    rankdir=TB;
    node [shape=diamond];

    "What stage?" -> "Pre-commit" [label="writing code"];
    "What stage?" -> "Pre-PR" [label="ready to submit"];
    "What stage?" -> "Post-commit" [label="just committed"];
    "What stage?" -> "Investigating" [label="specific concern"];

    node [shape=box];
    "Pre-commit" -> "Pattern 3: WIP Check";
    "Pre-PR" -> "How big?";
    "Post-commit" -> "Pattern 2: Commit Review";
    "Investigating" -> "Pattern 4: Focused Investigation";

    "How big?" [shape=diamond];
    "How big?" -> "Pattern 1: Pre-PR Review" [label="< 300 lines"];
    "How big?" -> "Full Multi-Pass" [label=">= 300 lines"];
}

Prompt Engineering Rules

Assign a persona — "senior security engineer" beats "review for security"
Specify what to skip — "Skip formatting, naming style, minor docs gaps" prevents bikeshedding
Require confidence scores — Only act on findings with confidence >= 0.7
Demand file:line citations — Vague findings without location are not actionable
Ask for concrete fixes — "Suggest a specific fix" not just "this is a problem"
One domain per pass — Security-only, architecture-only. Mixing dilutes depth.

Ready-to-use prompt templates are in references/prompts.md.

Anti-Patterns

Anti-Pattern	Why It Fails	Fix
"Review this code"	Too vague — produces surface-level bikeshedding	Use specific domain prompts with persona
Single pass for everything	Context dilution — every dimension gets shallow treatment	Multi-pass with one concern per pass
Self-review (Claude reviews Claude's code)	Systematic bias — models approve their own patterns	Cross-model: Claude writes, Codex reviews
No confidence threshold	Noise floods signal — 0.3 confidence findings waste time	Only act on >= 0.7 confidence
Style comments in review	LLMs default to bikeshedding without explicit skip directives	"Skip: formatting, naming, minor docs"
> 3 review iterations	Diminishing returns, increasing noise, overbaking	Stop at 3. Accept trade-offs.
Review without project context	Generic advice disconnected from codebase conventions	Run from repo root so project memory and source files are visible
Using an MCP wrapper	Unnecessary indirection over a CLI binary	Call `codex` directly via Bash
Hardcoding model names	Overrides user config and goes stale fast	Use the defaults from `~/.codex/config.toml`; never guess model names
Overcomplicating the invocation	Adding unnecessary flags, custom reasoning efforts, or exotic configs	Use `codex review` with simple flags (`--uncommitted`, `--base main`). The defaults are good

What This Skill is NOT

Not a replacement for human review. Cross-model review catches bugs but can't evaluate product direction or user experience.
Not a linter. Don't use Codex review for formatting or style — that's what linters are for.
Not infallible. 5-15% false positive rate is normal. Triage findings, don't blindly fix everything.
Not for self-approval. The whole point is cross-model validation. Don't use Claude to review Claude's code.

References

For ready-to-use prompt templates, see references/prompts.md.

codex-review

Safety Notice

Copy this and send it to your AI assistant to learn