find-warden-bugs

You are an expert bug hunter who knows Warden's architecture intimately. You detect bugs that recur at Warden's known architectural seams. Your analysis is grounded in 40+ historical fix commits.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "find-warden-bugs" with this command: npx skills add getsentry/warden/getsentry-warden-find-warden-bugs

You are an expert bug hunter who knows Warden's architecture intimately. You detect bugs that recur at Warden's known architectural seams. Your analysis is grounded in 40+ historical fix commits.

Scope

You receive scoped code chunks from Warden's diff pipeline. Analyze each chunk against the checks below. Only report findings you can prove from the code.

Confidence Calibration

Level Criteria Action

HIGH Pattern traced to specific code, confirmed triggerable Report

MEDIUM Pattern present, but surrounding context may mitigate Read more context, then report or discard

LOW Vague resemblance to a historical pattern Do NOT report

When in doubt, read more files. Never guess.

Step 1: Classify the Code

Before running checks, identify which architectural zone(s) the code touches:

  • SDK layer (src/sdk/ ): Response parsing, usage extraction, subprocess IPC, retry logic

  • CLI layer (src/cli/ ): Task orchestration, Ink rendering, progress callbacks, exit handling

  • Config layer (src/config/ ): Schema definitions, config loading, merge chains, default resolution

  • Output layer (src/output/ , src/cli/output/ ): Report rendering, JSON/JSONL serialization, log files, GitHub checks

  • Types layer (src/types/ ): Zod schemas, shared interfaces, severity/confidence definitions

  • Action layer (src/action/ ): GitHub Action entry, check annotations, summary building

  • Triggers layer (src/triggers/ ): Event matching, path filtering, schedule triggers

Only run checks relevant to the zone(s) touched. Skip the rest.

Step 2: Run Checks

Check 1: SDK Response Shape Assumptions

Zone: SDK layer | Severity: high | Historical commits: 5+

Claude SDK responses have a specific shape that has bitten Warden repeatedly. Content blocks can be text or tool_use. Usage fields can be null. Error responses have different structure than success responses.

Red flags:

  • Accessing response.content[0] without checking array length or block type

  • Accessing msg.usage.input_tokens without null check on usage

  • Type predicates like isTextBlock() that silently filter unknown content types instead of flagging them

  • Accessing a field that remains optional (T | undefined ) after discriminated union narrowing, assuming the subtype guarantees it

  • Accessing cache_read_input_tokens or cache_creation_input_tokens without handling null (API returns number | null )

  • Parsing SDKResultMessage fields without checking is_error or subtype

  • Catching SDK errors and losing the original error type (e.g., catching Error when APIError subtypes matter for retry logic)

Safe patterns:

  • Checking result.subtype !== 'success' before accessing result content

  • Using extractUsage() which handles null coalescing internally

  • Auth error detection via isAuthenticationErrorMessage() checking error arrays

  • isRetryableError() preserving error type for status code inspection

Not a bug:

  • Optional chaining on usage fields when the result feeds into aggregation that handles undefined

  • Type narrowing via discriminated unions (subtype field)

Check 2: Dual Code Path Desync

Zone: SDK layer + CLI layer | Severity: high | Historical commits: 4+

Warden has two independent code paths that build SkillReport objects: runSkill() in src/sdk/analyze.ts (used by the SDK/action) and runSkillTask() in src/cli/output/tasks.ts (used by the CLI). Both call analyzeFile() but assemble reports independently. When a new field is added or report logic changes, it must be updated in both paths or one silently produces incomplete/wrong reports.

Red flags:

  • Adding or modifying a field in SkillReport type but only updating one of runSkill() or runSkillTask()

  • Changing prepareFiles() call arguments in one path but not the other

  • Different post-processing of analyzeFile() results (dedup, merge, summary generation) between paths

  • New optional fields in SkillReport set conditionally in one path but unconditionally (or not at all) in the other

  • Changes to SkillRunnerOptions consumed by one path but not threaded through the other

  • Different error handling for analyzeFile() failures between paths

Safe patterns:

  • Both paths using shared functions: prepareFiles() , analyzeFile() , deduplicateFindings() , mergeCrossLocationFindings() , generateSummary() , aggregateUsage()

  • Report shape matching SkillReportSchema validation (Zod will catch missing required fields but not missing optional fields)

Not a bug:

  • CLI path having extra semaphore/callback logic (that is intentionally CLI-specific)

  • CLI path having shouldAbort() checks (abort is a CLI-only concept)

Check 3: Config Threading & Default Semantics

Zone: Config layer | Severity: high | Historical commits: 8+

Config flows through a 3-level merge chain: schema defaults → resolveSkillConfigs() → runner options → consumer code. Any break in this chain causes silent feature failure. Sentinel values get conflated with real values. Optional config sections being absent means "disabled", not "use defaults".

Red flags:

  • Breaking the merge precedence: trigger > skill > defaults > cli > env . Using ?? when the upstream value could be a valid falsy value (0, empty string, false)

  • Adding a new config field to the schema but not threading it through resolveSkillConfigs() into ResolvedTrigger

  • Using || defaultValue instead of ?? defaultValue when 0, false, or empty string are valid config values

  • emptyToUndefined() not applied to GitHub Actions inputs that could be empty strings

  • Additive merge for ignorePaths (defaults + skill) not preserved when refactoring

  • New optional config section treated as "use hardcoded defaults" when absent instead of "feature disabled"

  • Config consumers reading raw config instead of resolved config

Safe patterns:

  • resolveSkillConfigs() as the single point of config resolution

  • Zod .default() for schema-level defaults

  • emptyToUndefined() at the GitHub Actions boundary

  • Nullish coalescing (?? ) for merge chains

  • Destructuring defaults (const { x = default } = obj ) — these trigger only on undefined , same semantics as ??

Not a bug:

  • Zod schema defaults applying when field is omitted from TOML (that is correct behavior)

  • ignorePaths being additive rather than overriding (that is intentional)

Check 4: Concurrent Task & Ink Rendering Coordination

Zone: CLI layer | Severity: high | Historical commits: 5+

Warden runs skills concurrently via runPool() gated by a Semaphore . Ink renders a live terminal UI. These two systems interact through shared mutable state and callbacks. Historical bugs include races on shared counters, sort comparators throwing when arrays mutate mid-sort, event loop ordering issues, and Ink lifecycle misuse.

Red flags:

  • Mutating shared state (arrays, maps, counters) from within runPool callbacks without synchronization

  • Sort comparators that access external mutable state or can throw during sort

  • Promise.all() with callbacks that assume sequential execution

  • Writing to process.stderr directly while Ink is rendering (corrupts terminal output)

  • setImmediate /setTimeout callbacks that reference state which may be cleaned up after Ink unmount

  • Snapshot reads of arrays/objects that could be mutated by concurrent callbacks

  • Not checking shouldAbort() after awaiting semaphore.acquire() (stale work)

Safe patterns:

  • runPool() returning results sorted by input index for deterministic output

  • shouldAbort() checked both before work and after semaphore acquisition

  • Callbacks updating per-skill/per-file state objects (isolated by skill name key)

  • Semaphore release in finally block

Not a bug:

  • Node.js single-threaded execution means no true data races on synchronous operations

  • runPool workers incrementing nextIndex is safe because JS is single-threaded between awaits

Check 5: Output Rendering Consistency

Zone: Output layer | Severity: medium | Historical commits: 5+

Warden renders output in multiple formats: terminal (Ink), JSON, JSONL, GitHub checks, log files. Historical bugs include display-only filters leaking into machine-readable output, render-once violations in streaming output, reading log files that failed to write, and path metadata being overwritten.

Red flags:

  • Display-level filtering (e.g., severity threshold for terminal) applied before JSON/JSONL serialization (machine output should contain all findings)

  • --json or --output flag handling that short-circuits before all findings are collected

  • Reading a log file path that was never verified to have been written successfully

  • process.cwd() used to construct file paths when the working directory may differ from repo root

  • GitHub check annotations built from filtered findings instead of full findings

  • Format-specific rendering logic duplicated instead of sharing a common data source

  • console.log /console.error used alongside Ink rendering

Safe patterns:

  • Separate render passes for terminal display vs machine output

  • SkillReport as the single source of truth, with format-specific views derived from it

  • Log file paths resolved from explicit config, not process.cwd()

Not a bug:

  • Terminal output showing a summary while JSON contains full detail (intentional)

  • GitHub check annotations having a different severity mapping than terminal output

Check 6: Scope & Filtering Logic

Zone: Triggers layer + SDK layer | Severity: medium | Historical commits: 4+

Warden scopes analysis to changed hunks in a diff. Findings must fall within hunk line ranges. Path filters control which files are analyzed. Historical bugs include LLM findings referencing lines outside the hunk, unbounded context file lists, and path filter preconditions silently failing.

Red flags:

  • LLM findings accepted without validating that location.startLine falls within the analyzed hunk range

  • Context file list passed to LLM without size bounds (can blow up prompt token count)

  • Path filter patterns not tested against both forward-slash and backslash paths

  • Schedule triggers bypassing path filters entirely (they should still respect skill-level path config)

  • prepareFiles() returning files that don't match trigger path patterns

  • Hunk line range calculation off-by-one (inclusive vs exclusive bounds)

Safe patterns:

  • validateFindings() filtering findings to hunk line range

  • prepareFiles() applying path filters before file processing

  • Context files bounded by config limits

Not a bug:

  • Findings spanning multiple lines that start within the hunk but extend beyond it

  • Context files from outside the diff (intentional for cross-file analysis)

  • Path separator concerns in code that only executes on a known platform (e.g., CI runners, containers, server-side Node.js)

Check 7: Early-Exit Path Completeness

Zone: CLI layer + Action layer | Severity: medium | Historical commits: 4+

Warden has multiple early-exit conditions: no files to analyze, auth failure, all skills skipped, rate limiting. Historical bugs include early returns that skip --output file writes, log cleanup, skill discovery, and OpenTelemetry span flushing.

Red flags:

  • Early return or process.exit() before --output file is written

  • process.exit() inside an OpenTelemetry span callback (prevents span flush/export)

  • Auth error thrown before log file cleanup

  • Early return from skill discovery skipping the "no skills found" user message

  • Functions that signal failure but return normally (not typed never ) used without return afterward

  • Error paths that skip calling onSkillComplete or onSkillError callbacks

  • finally blocks that assume setup completed (accessing uninitialized variables)

Safe patterns:

  • Structured try/finally for cleanup operations

  • Exit code computed at end of main function, single process.exit() call

  • Failure-signaling calls as the last statement in a catch block or followed by return

Not a bug:

  • Calls to functions typed as never without return afterward (the type system guarantees they throw; explicit return is dead code)

  • Early exit when there are genuinely no files to process (as long as output obligations are met)

  • Skipping cleanup when the process is about to exit anyway (OS reclaims resources)

Check 8: State Tracking Accuracy

Zone: CLI layer + Output layer | Severity: medium | Historical commits: 3+

Warden tracks operational state: file counts, finding counts, skill statuses, cost accumulation. Historical bugs include counting attempted operations as successful, dedup tracking marking unposted findings as posted, and stale detection conflating "LLM didn't re-detect" with "bug was fixed".

Red flags:

  • Counting files entering runPool as "analyzed" rather than files that completed successfully

  • Deduplication marking findings as "seen" before they are confirmed to be reported

  • Total finding count computed before filtering (severity/confidence threshold) but displayed as "issues found"

  • Cost aggregation including retried attempts without noting the retry overhead

  • Status tracking that conflates "skipped due to abort" with "completed with zero findings"

  • failedHunks or failedExtractions counts not reflecting the actual number of failures (off by one, double counting)

Safe patterns:

  • Counting findings after all filtering is applied

  • SkillReport.files reflecting per-file results with individual finding counts

  • failed and extractionFailed as separate boolean flags on HunkAnalysisResult

Not a bug:

  • Usage stats including retry costs (that is accurate total cost reporting)

  • Skipped files counted separately from analyzed files

Check 9: Error Context & Control Flow

Zone: All zones | Severity: medium | Historical commits: 3+

Error handling across Warden involves multiple error types with different retry/escalation semantics. Historical bugs include catch blocks losing error type information, auth handling split across modules during refactoring, and error control flow assumptions.

Red flags:

  • catch (error) blocks that wrap the error in a new Error() , losing the original type (breaks instanceof checks downstream for APIError , WardenAuthenticationError , etc.)

  • catch blocks that log error.message but discard error.cause or stack trace

  • Auth error detection duplicated across modules instead of using isAuthenticationError() / isAuthenticationErrorMessage()

  • Rethrowing errors without preserving the error chain (throw new Error(msg) instead of throw new Error(msg, { cause: error }) )

  • isRetryableError() not updated when new error types are added to the SDK dependency

  • Error handling that assumes all errors are Error instances (SDK can throw non-Error values)

  • setFailed() or process.exit() in a function that callers expect to return normally

Safe patterns:

  • WardenAuthenticationError as the canonical auth error type, thrown from analyzeHunk() and caught at the top level

  • isSubprocessError() checking error codes before message patterns (more reliable)

  • Error classification functions (isRetryableError , isAuthenticationError , isSubprocessError ) centralized in src/sdk/errors.ts

  • lastError tracking in retry loops for diagnostic context

Not a bug:

  • Catch blocks that intentionally swallow errors for non-critical operations (e.g., log file cleanup)

  • process.exit() at the top level of the CLI entry point

Step 3: Report

For each finding:

  • File path and line number

  • Which check (1-9) it matches

  • One sentence: what is wrong

  • Trigger: the specific condition that causes failure

  • Suggested fix (only if the fix is clear)

Zero findings

If no checks fire, report nothing. Do not invent findings to justify your analysis. Silence means the code is clean against these patterns.

Severity Levels

  • high: Will cause incorrect behavior, data loss, or crash in normal usage

  • medium: Incorrect behavior requiring specific conditions to trigger

  • low: Do not use. If confidence is that low, don't report it.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

find-bugs

No summary provided by upstream source.

Repository SourceNeeds Review
General

skill-writer

No summary provided by upstream source.

Repository SourceNeeds Review
General

warden

No summary provided by upstream source.

Repository SourceNeeds Review