Debug
Goals
-
Find why a run is stuck, retrying, or failing.
-
Correlate Linear issue identity to a Codex session quickly.
-
Read the right logs in the right order to isolate root cause.
Log Sources
-
Primary runtime log: log/symphony.log
-
Default comes from SymphonyElixir.LogFile (log/symphony.log ).
-
Includes orchestrator, agent runner, and Codex app-server lifecycle logs.
-
Rotated runtime logs: log/symphony.log*
-
Check these when the relevant run is older.
Correlation Keys
-
issue_identifier : human ticket key (example: MT-625 )
-
issue_id : Linear UUID (stable internal ID)
-
session_id : Codex thread-turn pair (<thread_id>-<turn_id> )
elixir/docs/logging.md requires these fields for issue/session lifecycle logs. Use them as your join keys during debugging.
Quick Triage (Stuck Run)
-
Confirm scheduler/worker symptoms for the ticket.
-
Find recent lines for the ticket (issue_identifier first).
-
Extract session_id from matching lines.
-
Trace that session_id across start, stream, completion/failure, and stall handling logs.
-
Decide class of failure: timeout/stall, app-server startup failure, turn failure, or orchestrator retry loop.
Commands
1) Narrow by ticket key (fastest entry point)
rg -n "issue_identifier=MT-625" log/symphony.log*
2) If needed, narrow by Linear UUID
rg -n "issue_id=<linear-uuid>" log/symphony.log*
3) Pull session IDs seen for that ticket
rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u
4) Trace one session end-to-end
rg -n "session_id=<thread>-<turn>" log/symphony.log*
5) Focus on stuck/retry signals
rg -n "Issue stalled|scheduling retry|turn_timeout|turn_failed|Codex session failed|Codex session ended with error" log/symphony.log*
Investigation Flow
-
Locate the ticket slice:
-
Search by issue_identifier=<KEY> .
-
If noise is high, add issue_id=<UUID> .
-
Establish timeline:
-
Identify first Codex session started ... session_id=... .
-
Follow with Codex session completed , ended with error , or worker exit lines.
-
Classify the problem:
-
Stall loop: Issue stalled ... restarting with backoff .
-
App-server startup: Codex session failed ... .
-
Turn execution failure: turn_failed , turn_cancelled , turn_timeout , or ended with error .
-
Worker crash: Agent task exited ... reason=... .
-
Validate scope:
-
Check whether failures are isolated to one issue/session or repeating across multiple tickets.
-
Capture evidence:
-
Save key log lines with timestamps, issue_identifier , issue_id , and session_id .
-
Record probable root cause and the exact failing stage.
Reading Codex Session Logs
In Symphony, Codex session diagnostics are emitted into log/symphony.log and keyed by session_id . Read them as a lifecycle:
-
Codex session started ... session_id=...
-
Session stream/lifecycle events for the same session_id
-
Terminal event:
-
Codex session completed ... , or
-
Codex session ended with error ... , or
-
Issue stalled ... restarting with backoff
For one specific session investigation, keep the trace narrow:
-
Capture one session_id for the ticket.
-
Build a timestamped slice for only that session:
-
rg -n "session_id=<thread>-<turn>" log/symphony.log*
-
Mark the exact failing stage:
-
Startup failure before stream events (Codex session failed ... ).
-
Turn/runtime failure after stream events (turn_* / ended with error ).
-
Stall recovery (Issue stalled ... restarting with backoff ).
-
Pair findings with issue_identifier and issue_id from nearby lines to confirm you are not mixing concurrent retries.
Always pair session findings with issue_identifier /issue_id to avoid mixing concurrent runs.
Notes
-
Prefer rg over grep for speed on large logs.
-
Check rotated logs (log/symphony.log* ) before concluding data is missing.
-
If required context fields are missing in new log statements, align with elixir/docs/logging.md conventions.