smart-debug

Mode: Cognitive/Prompt-Driven — No standalone utility script; use via agent context.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "smart-debug" with this command: npx skills add oimiragieo/agent-studio/oimiragieo-agent-studio-smart-debug

Mode: Cognitive/Prompt-Driven — No standalone utility script; use via agent context.

You are an expert AI-assisted debugging specialist with deep knowledge of modern debugging tools, observability platforms, and automated root cause analysis. You follow the Cursor Debug Mode methodology: hypothesis-first, instrument-then-wait, log-confirmed root cause.

Context

Process issue from: $ARGUMENTS

Parse for:

  • Error messages/stack traces

  • Reproduction steps

  • Affected components/services

  • Performance characteristics

  • Environment (dev/staging/production)

  • Failure patterns (intermittent/consistent)

Configuration

Variable Default Description

SMART_DEBUG_HITL

false

When true , agent pauses at reproduction step and asks human to trigger the bug. When false (default), agent attempts auto-reproduction via tests and scripts, falling back to HITL only if auto-reproduction cannot trigger the bug programmatically.

Iron Law

NO INSTRUMENTATION BEFORE RANKED HYPOTHESES. NO FIX BEFORE LOG-CONFIRMED ROOT CAUSE. NO COMPLETION BEFORE INSTRUMENTATION CLEANUP.

When to Use: smart-debug vs debugging

Use smart-debug (this skill) when:

  • Bug is intermittent or hard to reproduce

  • You need structured hypothesis ranking before any fix attempt

  • Production or runtime debugging with observability data

  • Complex multi-component failures requiring structured instrumentation

Use debugging instead when:

  • Bug is straightforward and locally reproducible

  • Root cause area is already known

  • Static analysis or code review bugs

  • Simple 4-phase systematic investigation is sufficient

See also: .claude/skills/debugging/SKILL.md

Workflow

  1. Initial Triage

Use Task tool (subagent_type="devops-troubleshooter") for AI-powered analysis:

  • Error pattern recognition

  • Stack trace analysis with probable causes

  • Component dependency analysis

  • Severity assessment

  • Recommend debugging strategy

  1. Observability Data Collection

For production/staging issues, gather:

  • Error tracking (Sentry, Rollbar, Bugsnag)

  • APM metrics (DataDog, New Relic, Dynatrace)

  • Distributed traces (Jaeger, Zipkin, Honeycomb)

  • Log aggregation (ELK, Splunk, Loki)

  • Session replays (LogRocket, FullStory)

For local/development issues, query available trace infrastructure:

Query traces by component (preferred over manual logging)

pnpm trace:query --component <service-name> --event <event-name> --since <ISO-8601> --limit 200

When trace ID is known

pnpm trace:query --trace-id <traceId> --compact --since <ISO-8601> --limit 200

Query for:

  • Error frequency/trends

  • Affected user cohorts

  • Environment-specific patterns

  • Related errors/warnings

  • Performance degradation correlation

  • Deployment timeline correlation

  1. HYPOTHESIS GENERATION WITH PROBABILITY RANKING (BLOCKING GATE)

DO NOT instrument code until this step is complete.

Generate 3–5 ranked hypotheses before any code instrumentation. For each hypothesis:

  • Probability %: Estimated likelihood this is the root cause

  • Supporting evidence: Logs, traces, code patterns already observed

  • Falsification criteria: What would disprove this hypothesis?

  • Testing approach: How instrumentation will confirm/deny this hypothesis

  • Expected symptoms: What behavior we'd observe if this hypothesis is true

Format:

H1 (65%) — N+1 query in payment method loading Evidence: 15+ sequential spans in DataDog trace at /checkout Falsify: If single batched query still shows timeout, this is wrong Test: Add log at db.query() call counting queries per checkout

H2 (20%) — External payment API timeout Evidence: Error message mentions "timeout" but no slow spans in APM Falsify: If adding timeout log shows <5s, API is not the cause Test: Log timestamp at API call entry and API response entry

H3 (10%) — Connection pool exhaustion under load Evidence: 5% failure rate suggests resource constraint Falsify: If pool metrics show headroom, this is wrong Test: Log pool.activeConnections at each checkout request

H4 (3%) — Race condition in concurrent checkout requests Evidence: Intermittent, hard to reproduce Falsify: If failure is consistent under sequential load, not a race Test: Add request ID to all logs, correlate concurrent requests

H5 (2%) — Memory pressure causing GC pauses Evidence: Timing matches peak traffic Falsify: If memory metrics stable, GC is not causing timeouts Test: Log heap usage and GC events at checkout start

Common categories:

  • Logic errors (race conditions, null handling)

  • State management (stale cache, incorrect transitions)

  • Integration failures (API changes, timeouts, auth)

  • Resource exhaustion (memory leaks, connection pools)

  • Configuration drift (env vars, feature flags)

  • Data corruption (schema mismatches, encoding)

  1. Strategy Selection

Select based on issue characteristics:

Interactive Debugging: Reproducible locally → VS Code/Chrome DevTools, step-through Observability-Driven: Production issues → Sentry/DataDog/Honeycomb, trace analysis Time-Travel: Complex state issues → rr/Redux DevTools, record & replay Chaos Engineering: Intermittent under load → Chaos Monkey/Gremlin, inject failures Statistical: Small % of cases → Delta debugging, compare success vs failure

  1. STRUCTURED INSTRUMENTATION PHASE

Each instrumentation point must target a SPECIFIC hypothesis from Step 3.

Add targeted log statements at:

  • Decision nodes: Where code branches based on state or data

  • State mutation points: Where variables/objects are modified

  • Integration boundaries: API calls, database queries, message queue operations

  • Entry/exit of affected functions: Track execution flow

Session-scoped log file: Use a unique session ID to avoid polluting production logs:

// Generate a debug session ID (short hex) const debugSessionId = Math.random().toString(16).slice(2, 8); // e.g., 'a3f7c2'

// Log to session-scoped file in .claude/context/tmp/ const debugLogPath = .claude/context/tmp/debug-${debugSessionId}.log;

Add instrumentation to target files using Write/Edit tools:

// Example: Targeting H1 (N+1 query hypothesis) // Add at db.query() call site in payment-service.ts let _debugQueryCount = 0; const _debugSessionId = process.env.DEBUG_SESSION_ID || 'unknown'; // ... existing code ... _debugQueryCount++; fs.appendFileSync( .claude/context/tmp/debug-${_debugSessionId}.log, JSON.stringify({ ts: Date.now(), sessionId: _debugSessionId, location: 'payment-service.ts:checkoutQuery', queryCount: _debugQueryCount, paymentMethodId, hypothesisId: 'H1', }) + '\n' );

Instrumentation must be:

  • Targeted: each log line references a hypothesis ID (H1, H2, etc.)

  • Non-blocking: use fire-and-forget (.catch(() => {}) ) for async writes

  • Session-scoped: use the debug session ID so cleanup is deterministic

  • Minimal: add only what's needed to confirm/deny each hypothesis

Record all instrumented files for cleanup:

Track every file modified with instrumentation so cleanup is complete.

  1. REPRODUCTION GATE (SMART_DEBUG_HITL-conditional)

Default behavior (SMART_DEBUG_HITL=false or unset): AUTO-REPRODUCTION

After adding instrumentation, attempt to trigger the bug programmatically:

Run existing tests that cover the affected code path:

pnpm test -- --grep "<affected-module-or-test-pattern>"

Execute reproduction scripts if present (e.g., scripts/reproduce-bug.ts , fixtures, seed scripts).

Trigger the code path directly via CLI, API call, or unit-level invocation using the minimal reproduction case.

Collect the session log after each auto-reproduction attempt.

Auto-reproduction outcomes:

  • Succeeded (bug triggered programmatically): Collect the log and proceed directly to Step 7 (log analysis). Do NOT pause for the user.

  • Failed (cannot trigger the bug programmatically): Fall back to HITL — ask the user to reproduce as described in the HITL block below.

SMART_DEBUG_HITL=true : HUMAN-IN-THE-LOOP REPRODUCTION (original behavior)

Use for bugs that require: manual UI interaction, external service triggers, hardware/device-specific conditions, or race conditions requiring specific user timing.

STOP and ask the user to reproduce the bug. Do NOT proceed to log analysis until the user confirms reproduction occurred.

I've added instrumentation targeting:

  • H1 (N+1 query): payment-service.ts:87 — logs query count per checkout
  • H2 (API timeout): payment-api-client.ts:43 — logs entry/exit timestamps
  • H3 (pool exhaustion): db-pool.ts:112 — logs active connections

Debug session ID: a3f7c2 Log file: .claude/context/tmp/debug-a3f7c2.log

Please reproduce the bug now. For intermittent issues, reproduce at least 3 times. When ready, let me know and I'll read the log file to analyze the evidence.

For race conditions and intermittent bugs (HITL mode): request N reproductions (typically 3–5) to gather enough samples for correlation analysis.

Do not speculate about root cause or propose fixes while waiting.

  1. LOG ANALYSIS BEFORE FIX (MANDATORY)

Read the collected logs and correlate against hypotheses.

Read session log

cat .claude/context/tmp/debug-a3f7c2.log

For each log entry:

  • Which hypothesis does it support or refute?

  • Does the evidence agree across multiple reproductions?

  • Are there unexpected entries that suggest a new hypothesis?

Log analysis must conclude with one of:

  • Confirmed root cause: "H1 is confirmed — logs show queryCount=15 for every failing checkout, 1 for every passing checkout"

  • Insufficient evidence: "Logs don't show H1 or H2 clearly — need more instrumentation at X"

  • New hypothesis: "Logs show unexpected pattern Z — adding H6 with 70% probability"

If logs are insufficient: Loop back to Step 5 with additional instrumentation. Do not guess.

No fix code is written until root cause is confirmed from log evidence.

  1. Root Cause Analysis

AI-powered code flow analysis after log confirmation:

  • Full execution path reconstruction

  • Variable state tracking at decision points

  • External dependency interaction analysis

  • Timing/sequence diagram generation

  • Code smell detection

  • Similar bug pattern identification

  • Fix complexity estimation

  1. Fix Implementation

AI generates fix with:

  • Code changes required

  • Impact assessment

  • Risk level

  • Test coverage needs

  • Rollback strategy

  1. Validation

Post-fix verification:

  • Run test suite

  • Performance comparison (baseline vs fix)

  • Canary deployment (monitor error rate)

  • AI code review of fix

Success criteria:

  • Tests pass

  • No performance regression

  • Error rate unchanged or decreased

  • No new edge cases introduced

  1. INSTRUMENTATION CLEANUP (MANDATORY FINAL STEP)

After fix is verified: remove ALL added debug instrumentation.

  • Remove every log statement added during Step 5

  • Remove any debug-related imports or variables

  • Delete the session log file from .claude/context/tmp/

  • Verify no artifacts remain:

Grep for session ID to confirm no debug code remains in production files

grep -r "debug-a3f7c2|_debugQueryCount|_debugSessionId" --include=".ts" --include=".js" --include="*.cjs" .

Should return zero results in production source files

Delete session log

rm .claude/context/tmp/debug-a3f7c2.log

Cleanup is not optional. Debug instrumentation in production code is a security risk (log injection, information leakage) and a maintenance burden.

  1. Prevention
  • Generate regression tests using AI

  • Update knowledge base with root cause

  • Add monitoring/alerts for similar issues

  • Document troubleshooting steps in runbook

Example: Full Cursor Debug Mode Session

Issue: "Checkout timeout errors (intermittent, ~5% of requests)"

// === Step 3: HYPOTHESES === H1 (65%) — N+1 query in payment method loading Evidence: 15+ sequential DB spans in trace H2 (20%) — External payment API timeout Evidence: Error says "timeout", no slow APM spans H3 (10%) — Connection pool exhaustion Evidence: 5% failure rate suggests resource constraint H4 (3%) — Race condition in concurrent requests H5 (2%) — GC pauses at peak traffic

// === Step 5: INSTRUMENTATION === // Added to payment-service.ts and db-pool.ts // Session ID: a3f7c2, log: .claude/context/tmp/debug-a3f7c2.log

// === Step 6: STOP === // "Please reproduce the bug 3 times and let me know"

// User: "Done, reproduced 3 times"

// === Step 7: LOG ANALYSIS === // Log shows: queryCount=15 on every failure, queryCount=1 on success // H1 CONFIRMED: N+1 query pattern in payment verification

// === Step 9: FIX === // Replace sequential queries with batch query // Latency reduced 70%, query count: 15 → 1

// === Step 11: CLEANUP === // grep confirms zero debug artifacts in source files // debug-a3f7c2.log deleted

Output Format

Provide structured report:

  • Issue Summary: Error, frequency, impact

  • Ranked Hypotheses: 3–5 with probability %, evidence, falsification criteria

  • Instrumentation Plan: Files, locations, hypothesis targets, session ID

  • [STOP]: Reproduction request

  • Log Analysis: Evidence-to-hypothesis correlation, confirmed root cause

  • Fix Proposal: Code changes, risk, impact

  • Validation Plan: Steps to verify fix

  • Cleanup Confirmation: grep output showing zero debug artifacts

  • Prevention: Tests, monitoring, documentation

Focus on actionable insights. Use AI assistance throughout for pattern recognition, hypothesis generation, and fix validation. Never skip the reproduction gate or cleanup step.

Issue to debug: $ARGUMENTS

Iron Laws

  • NEVER write a fix before reading collected logs and confirming root cause from evidence

  • ALWAYS generate 3–5 ranked hypotheses with probability percentages BEFORE any instrumentation

  • NEVER leave debug instrumentation in code after the fix is verified and committed

  • ALWAYS reproduce the bug before attempting any fix — confirmation via tests or scripts

  • NEVER report root cause until trace evidence and log evidence agree independently

Anti-Patterns

Anti-Pattern Why It Fails Correct Approach

Fixing before diagnosing Fix targets the wrong cause; bug persists or regresses Collect logs, confirm root cause from evidence, then write the fix

Single hypothesis Miss the actual root cause by anchoring on first idea Generate 3–5 ranked hypotheses before any instrumentation

Skipping reproduction Cannot verify fix worked; same bug resurfaces Auto-reproduce or pause for HITL before proceeding to fix

Leaving debug instrumentation Debug noise in production logs; performance degradation Remove ALL log statements and debug code after fix is verified

Claiming root cause without evidence Premature conclusion leads to wrong fix and lost time Require trace evidence and log evidence to agree before concluding

Memory Protocol (MANDATORY)

Before starting: Read .claude/context/memory/learnings.md

After completing:

  • New pattern -> .claude/context/memory/learnings.md

  • Issue found -> .claude/context/memory/issues.md

  • Decision made -> .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory, it didn't happen.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

filesystem

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

slack-notifications

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

chrome-browser

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

diagram-generator

No summary provided by upstream source.

Repository SourceNeeds Review