LLM Codex Readiness Integration Test
This skill runs a multi-stage integration test to validate agentic execution quality. It always runs in execute mode (no read-only mode).
Outputs
Each run writes to .codex-readiness-integration-test/<timestamp>/ and updates .codex-readiness-integration-test/latest.json .
New outputs per run:
-
agentic_summary.json and logs/agentic.log (agentic loop execution)
-
llm_results.json (automatic LLM evaluation)
-
summary.txt (human-readable summary)
Pre-conditions (Required)
-
Authenticate with the Codex CLI using the repo-local HOME before running the test. Run these in your own terminal (not via the integration test): HOME=$PWD/.codex-home XDG_CACHE_HOME=$PWD/.codex-home/.cache codex login HOME=$PWD/.codex-home XDG_CACHE_HOME=$PWD/.codex-home/.cache codex login status
-
The integration test creates {repo_root}/.codex-home and {repo_root}/.codex-home/.cache/codex as its first step.
Workflow
-
Ask the user how to source the task.
-
Offer two explicit options: (a) user provides a custom task/prompt, or (b) auto-generate a task.
-
Do not run the entry point until the user chooses one option.
-
Generate or load {out_dir}/prompt.pending.json .
-
Use the integration test's expected prompt path, not prompt.json at the repo root.
-
With the default out dir, this path is .codex-readiness-integration-test/prompt.pending.json .
-
If --seed-task is provided, it is used as the starting task.
-
If not provided, generate a task with skills/codex-readiness-integration-test/references/generate_prompt.md and save the JSON to {out_dir}/prompt.pending.json .
-
The user must approve the prompt before execution (no auto-approve mode). Make sure to output a summary of the prompt when asking the user to approve.
-
Execute the agentic loop via Codex CLI (uses AGENTS.md and change_prompt ).
-
Run build/test commands from the prompt plan via skills/codex-readiness-integration-test/scripts/run_plan.py .
-
Collect evidence (evidence.json ), deterministic checks, and run automatic LLM evals via Codex CLI.
-
Score and write the report + summary output.
Configuration
Optional fields in {out_dir}/prompt.pending.json :
-
agentic_loop : configure Codex CLI invocation for the agentic loop.
-
llm_eval : configure Codex CLI invocation for automatic evals.
If these fields are omitted, defaults are used.
Requirements
-
The LLM evaluator must fail if evidence mentions the phrase Context compaction enabled .
-
Use qualitative context-usage evaluation (no strict thresholds).
What this test covers well
-
Runs Codex CLI against the real repo root, producing real filesystem edits and git diffs.
-
Executes the approved change prompt and then runs the build/test plan in-repo.
-
Captures evidence, deterministic checks, and LLM eval artifacts for review.
What this test does not represent
-
The agentic loop may use non-default flags (e.g., bypass approvals/sandbox), so interactive guardrails differ.
-
Uses a dedicated HOME (.codex-home ), which can change auth/config/cache vs normal CLI use.
-
Auto-generated prompts and one-shot execution do not simulate interactive guidance.
-
MCP servers/tools are not exercised unless explicitly configured.
Notes
-
The prompts in skills/codex-readiness-integration-test/references/ expect strict JSON.
-
Use skills/codex-readiness-integration-test/references/json_fix.md to repair invalid JSON output.
-
This skill calls the codex CLI. Ensure it is installed and available on PATH, or override the command in {out_dir}/prompt.pending.json .
-
If the agentic loop detects sandbox-blocked tool access, it now writes requires_escalation: true to {run_dir}/agentic_summary.json and exits with code 3 . Re-run the integration test with escalated permissions in that case.