Structured Problem Solving

When to Use This vs. Direct Fix

Direct fix (skip this skill):

Error message points to exact cause
One-line config/code fix
You've seen this exact problem before

Use this skill:

You'd need to say "可能是..." to explain the cause
2+ components involved
You already tried a fix that didn't work
Wrong fix could cause data loss, privacy leak, or downtime

The Process

Step 0: Question Dissolution (消解层)

Before solving, check if the problem itself is valid. Many problems dissolve when examined properly.

Run these 3 checks sequentially. If any check dissolves the problem, stop and tell the user — a dissolved problem is more valuable than a solved one.

0.1 Language Trap Detection (语言陷阱)

Does the problem statement contain vague, undefined key terms?

Common trap words: "优化" "合适" "更好" "正常" "应该" "稳定"

Test: Can you give a measurable or actionable definition for every key term? If not, the problem can't be solved because it hasn't been stated.

→ If trapped: Ask the user to define the vague term. "你说的'优化'具体指什么？响应时间从 X 降到 Y？还是内存占用？还是用户体验？"

0.2 Hidden Assumption Check (假设检验)

Rewrite the problem as: "This problem assumes X. Is X true?"

Common false assumptions:

"系统变慢了" → assumes it was faster before (was it? measured when?)
"用户不喜欢这个功能" → assumes users have tried it (have they? data?)
"我们需要加这个功能" → assumes the current system can't do it (can it?)

→ If assumption is false: Tell the user. "你的问题假设了「X」，但这个前提可能不成立。如果 X 不成立，问题就消失了。"

0.3 Question vs. Problem Classification

Question: Has a standard answer, can be resolved by looking it up or reading docs
- → Answer directly, don't enter the full diagnostic process
Problem: No standard answer, requires investigation + experimentation
- → Continue to Step 1

If the problem survives all 3 checks, proceed to full diagnosis.

Step 1: Define the Problem

Turn vague "something's wrong" into a precise statement.

问题：[一句话]
现象：[具体发生了什么]
预期：[应该是什么样]
影响：[谁受影响，严重程度]
可复现：[是/否，触发条件]

Rules:

Describe what you observe, not what you think caused it
"webchat replies appear in DingTalk group" = problem ✅
"origin got polluted" = hypothesis, not problem ❌

Step 2: Diagnose

Do not skip to fixing. Trace the data flow end-to-end first.

2.1 Map the call chain

Input → Step A → Step B → Step C → Output
          ↓          ↓          ↓
        Check      Check      Check

2.2 Verify each step

Read actual values (logs, state files, source code). Do not guess.

2.3 Narrow down

Find the first step where output diverges from expected. That's where the bug is.

2.4 Confirm root cause

Three questions before you declare root cause:

Why? — Explain the mechanism, not just the symptom
Sufficient? — If I fix this, will the problem definitely disappear?
Unique? — Is there another cause that could produce the same symptom?

All three must be answered. If not → keep diagnosing.

Diagnostic tools (prefer in order):

Error messages / logs (fastest)
State inspection (config files, DB, session store)
Source code tracing (most reliable)
Minimal reproduction experiment

Step 3: Design Solutions

Generate at least 2 candidate solutions. Compare on:

Dimension	Question
Effectiveness	Fixes root cause or just symptom?
Risk	Could it break something else?
Complexity	How many components touched?
Reversibility	Can we roll back if wrong?
Durability	Survives restarts / updates?
Side effects	Impact on other features?

Present as:

方案 A：[one line]
  ✅ [pros]  ⚠️ [risks]

方案 B：[one line]
  ✅ [pros]  ⚠️ [risks]

→ 推荐 A，因为 [reason]

Always include the "do nothing / workaround" option if viable.

Step 4: Execute

Pre-flight checklist:

Root cause confirmed (not guessed)
Solution evaluated (not first idea)
User confirmed (for risky changes)
Rollback plan ready

Rules:

Change one variable at a time
Record what was changed and what it was before
Minimize scope — don't "fix other things while you're at it"

Step 5: Verify

Three levels of verification:

Direct: Reproduce original trigger → problem gone?
Regression: Related features still work?
Durability: Survives restart / next trigger?

Show evidence, don't say "应该好了".

Step 6: Review

## 复盘：[问题名]
耗时：X 分钟（有效 Y / 弯路 Z）
根因：[一句话]
修复：[一句话]
弯路：[走了什么弯路]
教训：[提炼的规则]

Write lessons to .learnings/ if reusable.

诊断超时与死胡同处理

信号	动作
同一假设连续 3 次验证无结论	停止，换假设或换诊断维度
累计诊断 >15 分钟无进展	暂停，向用户汇报已排除项 + 当前卡点，询问是否有额外线索
累计尝试 >5 个假设均被否定	考虑问题是否需要消解（回 Step 0）或需要更多上下文
修复后问题复现	不叠加补丁，回退到修复前状态，重新走 Step 1

Anti-patterns

Pattern	What it looks like	Fix
Guess-and-fix	See symptom → hypothesize → change immediately	Map call chain first
One-end-only	Check only input or output	Trace full data flow
Surface fix	Change the bad value without asking why it's bad	Ask "why did it become this value?"
Multi-change	Change 3 things at once	One variable at a time
Premature victory	"Should be fixed now" without checking	Show evidence
No rollback	Forget to record original values	Backup before modify

Communication During Problem-Solving

Define: Confirm understanding ("你说的问题是 X 对吗？")
Diagnose: Share progress, don't go silent ("在查 Y 环节，发现了 Z")
Design: Give choices, not just one option
Execute: Confirm before risky operations
Verify: Ask user to check on their end
Throughout: Say "I'm not sure yet" over false confidence

下一步建议（条件触发）

问题解决后，根据结果判断是否推荐下一步。

触发条件	推荐
根因是代码 bug，修复需要多文件改动	「根因清楚了，修复交给 coding-agent spawn Claude Code 来做。」
问题根因值得记录（同类问题可能再犯）	「这个教训值得记下来，写到 .learnings/ 防止再犯。」
问题在消解层被消解（问题本身不成立）	「问题已经消解了。如果背后有更大的决策要做，可以拉出来单独讨论。」
诊断过程发现系统架构层面的隐患	「这次修好了，但架构上还有隐患。要不要排个时间做一次 healthcheck？」

problem-solving

Safety Notice

Copy this and send it to your AI assistant to learn