Debug

Goals

Log Sources

Correlation Keys

elixir/docs/logging.md requires these fields for issue/session lifecycle logs. Use them as your join keys during debugging.

Quick Triage (Stuck Run)

Confirm scheduler/worker symptoms for the ticket.
Find recent lines for the ticket (issue_identifier first).
Extract session_id from matching lines.
Trace that session_id across start, stream, completion/failure, and stall handling logs.
Decide class of failure: timeout/stall, app-server startup failure, turn failure, or orchestrator retry loop.

Commands

1) Narrow by ticket key (fastest entry point)

rg -n "issue_identifier=MT-625" log/symphony.log*

2) If needed, narrow by Linear UUID

rg -n "issue_id=<linear-uuid>" log/symphony.log*

rg -o "session_id=[^ ;]+" log/symphony.log* | sort -u

rg -n "session_id=<thread>-<turn>" log/symphony.log*

Investigation Flow

Locate the ticket slice:
Search by issue_identifier=<KEY> .
If noise is high, add issue_id=<UUID> .
Establish timeline:
Identify first Codex session started ... session_id=... .
Follow with Codex session completed , ended with error , or worker exit lines.
Classify the problem:
Stall loop: Issue stalled ... restarting with backoff .
App-server startup: Codex session failed ... .
Turn execution failure: turn_failed , turn_cancelled , turn_timeout , or ended with error .
Worker crash: Agent task exited ... reason=... .
Validate scope:
Check whether failures are isolated to one issue/session or repeating across multiple tickets.
Capture evidence:
Save key log lines with timestamps, issue_identifier , issue_id , and session_id .
Record probable root cause and the exact failing stage.

Reading Codex Session Logs

In Symphony, Codex session diagnostics are emitted into log/symphony.log and keyed by session_id . Read them as a lifecycle:

For one specific session investigation, keep the trace narrow:

Capture one session_id for the ticket.
Build a timestamped slice for only that session:
rg -n "session_id=<thread>-<turn>" log/symphony.log*
Mark the exact failing stage:
Startup failure before stream events (Codex session failed ... ).
Turn/runtime failure after stream events (turn_* / ended with error ).
Stall recovery (Issue stalled ... restarting with backoff ).
Pair findings with issue_identifier and issue_id from nearby lines to confirm you are not mixing concurrent retries.

Always pair session findings with issue_identifier /issue_id to avoid mixing concurrent runs.

Notes

Prefer rg over grep for speed on large logs.
Check rotated logs (log/symphony.log* ) before concluding data is missing.
If required context fields are missing in new log statements, align with elixir/docs/logging.md conventions.