Research-Driven Debugging (research-debug)

복잡한 ML 학습 문제나 architecture-level 버그를 해결하는 통합 워크플로우입니다. 웹 연구, 코드 분석, 반복적 수정을 병렬로 진행하여 근본 원인을 찾고 해결합니다.

🎯 사용 시점

✅ Ideal Use Cases

Training collapse/divergence: 학습이 초기에는 잘 되다가 갑자기 붕괴
Gibberish generation: 모델이 무의미한 출력 생성
Known research problem: 문제가 논문/블로그에서 논의된 적이 있을 것 같은 경우
Architecture-level bugs: 단순 구현 버그가 아닌 설계 결함이 의심될 때
Performance anomalies: 예상과 다른 성능 패턴 (loss spike, reward collapse 등)

❌ Not Suitable For

Simple bugs: Syntax error, typo 등은 직접 수정
Well-defined requirements: "Add feature X" 같은 명확한 요구사항은 직접 구현
No literature: 완전히 새로운 문제는 research 단계가 비효율적

📋 Workflow Phases

Phase 1: Evidence Gathering (병렬 실행)

# 동시에 3가지 작업 시작:
# 1. Web research for similar cases
# 2. Deep code analysis with task-planner-analyzer
# 3. Start safe fixes (config changes, monitoring)

Step 1.1: Web Research

WebSearch(
    query="[problem description] [domain] [year]"
)
# Example: "GRPO policy gradient collapse vocabulary 2024 2025"

찾을 것:

Documented failure modes (문서화된 실패 패턴)
Fundamental design flaws (근본적 설계 결함)
Known workarounds (알려진 해결 방법)
Recent papers addressing the issue (최신 연구)

Step 1.2: Code Analysis (Task-Planner-Analyzer)

Task(
    subagent_type="task-planner-analyzer",
    prompt=f"""
Analyze {problem_description} in this codebase.

**Context from web research:**
{web_search_findings}

**Files to examine:**
{list_of_relevant_files}

**Your tasks:**
1. Examine codebase structure
2. Identify design flaws matching literature
3. Check for known anti-patterns
4. Create prioritized TODO list with:
   - File paths and line numbers
   - Root causes vs symptoms
   - Risk assessment
   - Dependencies and constraints
    """
)

Step 1.3: Start Safe Fixes (Optional)

# If you already know some safe fixes (e.g., config changes), start them
# while analysis is running
python scripts/train.py --config fixed_config.yaml > logs/new_run.log 2>&1 &

Phase 2: Root Cause Triangulation

Cross-Reference Literature ↔ Code

Create a mapping table:

Literature Finding	Code Location	Match?	Impact	Priority
Token-level issue	line 316	✅ YES	HIGH	1
Entropy collapse	line 888	✅ YES	CRITICAL	1
Conflicting grads	multiple	✅ YES	MEDIUM	2

Prioritize by Impact

CRITICAL: Collapse trigger (direct cause of observed failure)
HIGH: Fundamental flaw (will cause problems at scale)
MEDIUM: Optimization (improves stability but not essential)
LOW: Cosmetic (code quality, not behavior)

Phase 3: Iterative Fix-and-Verify

Fix Loop

FOR each priority (CRITICAL → HIGH → MEDIUM):
    1. Modular-Code-Architect: Apply fix
    2. Code-Reviewer: Verify no side effects
    3. Run tests (if applicable)
    4. Monitor metrics for early warnings
    5. If problem recurs → back to Task-Planner

Apply Fixes with Modular-Code-Architect

Task(
    subagent_type="modular-code-architect",
    prompt=f"""
Implement fix for {root_cause} based on this analysis:
{analysis_from_phase2}

**Constraints:**
{list_of_constraints}

**Verification criteria:**
{how_to_verify_fix_worked}

Follow modular design: minimal changes, plug-and-play.
    """
)

Verify with Code-Reviewer

Task(
    subagent_type="code-reviewer",
    prompt="""
Review recent changes for:
1. Critical issues (logic errors, side effects)
2. Consistency with architecture constraints
3. Whether the fix actually addresses the root cause

Use ultrathink level.
    """
)

Verification Criteria

No collapse for 100+ steps (or 10x previous collapse point)
Metrics stay within healthy ranges
No new issues introduced (regression tests pass)
Edge cases handled

Phase 4: Documentation

Create Analysis Document

# File: logs/{problem}_root_cause_analysis.md

## Problem Summary
[What happened]

## Root Causes (Ranked)
1. **RC1**: [Description]
   - Evidence: [Literature + Code]
   - Fix: [What was applied]
   - Verification: [How we know it worked]

## Fixes Applied
[Detailed changelog]

## Verification Results
[Metrics before/after]

## Lessons Learned
[What to watch for next time]

Update Project Memory

Add to MEMORY.md or similar:

New failure modes discovered
Effective fixes
Ineffective approaches (to avoid repeating)
Monitoring metrics to add

🔍 Monitoring & Debugging

Key Metrics to Watch

# For ML training issues:
metrics_to_monitor = [
    "loss",              # Should decrease steadily
    "reward",            # Should be stable or improve
    "entropy",           # High = gibberish, Low = mode collapse
    "gradient_norm",     # Should be bounded
    "v_norm",            # For LoRA: should not hit clamp boundary
    "kl_loss",           # For RL: should be non-zero when active
]

Early Warning Signs

# Add to training code:
if entropy > 0.9 * log(vocab_size):
    logger.warning("Entropy collapse imminent - gibberish likely")

if all_advantages_zero and mean_reward > 0.5:
    logger.warning("Perfect accuracy but zero variance - KL-only training")

if v_norm >= max_norm * 0.95:
    logger.warning("V-vector at clamp boundary - may be fighting constraint")

📚 Example: GRPO Collapse

Phase 1: Evidence Gathering

WebSearch: Found 5 papers documenting GRPO instability

Token-level importance weight fails (DAPO)
Catastrophic model collapse (GSPO)
Entropy collapse & gibberish (OpenReview)

Task-Planner: Identified 5 root causes in code

KL-only collapse trigger (CRITICAL)
Token-level mismatch (HIGH)
Task overfitting (HIGH)

Phase 2: Triangulation

Finding	Both Sources?	Priority
KL-only steps	✅ Yes	CRITICAL
Token-level	✅ Yes	HIGH
Task overfit	Code only	HIGH

Phase 3: Fix & Verify

Fix 1: Advantage-aware KL gating
- Applied, verified, no collapse at step 20
Fix 2: Length-normalized log-probs
- Importance ratios stable [0.5, 2.0]
Fix 3: Reduce steps_per_task 20→3
- New tasks every 3 steps, stable 200+ steps

Phase 4: Documentation

✅ Root cause analysis written
✅ MEMORY.md updated
✅ Workflow recipe created

⚠️ Common Pitfalls

1. Fixing Symptoms Instead of Root Causes

❌ Bad: "Gibberish appeared, let's increase temperature" ✅ Good: "Gibberish = entropy collapse. What causes that? KL-only signal."

2. Serial Execution (Wasting Time)

❌ Bad: WebSearch → wait → Analyze → wait → Fix → wait ✅ Good: WebSearch || Analyze || Start Safe Fixes → Integrate

3. Ignoring Literature

❌ Bad: "This is unique, no point searching" ✅ Good: "Let me check if anyone has seen this before"

4. Not Documenting Failures

❌ Bad: "That didn't work, let's try something else" ✅ Good: "Failed BECAUSE X, documented for future reference"

🚀 Quick Start Template

# 1. Start parallel evidence gathering
WebSearch("problem_description 2024 2025")
Task(subagent_type="task-planner-analyzer", prompt="Analyze {problem}...")

# 2. Start monitoring while waiting
tail -f logs/training.log &

# 3. Apply fixes iteratively
Task(subagent_type="modular-code-architect", prompt="Fix {root_cause}...")
Task(subagent_type="code-reviewer", prompt="Review changes...")

# 4. Document everything
Write("logs/root_cause_analysis.md", content="...")
Edit("MEMORY.md", add="New learnings...")

🎓 Success Criteria

Process Metrics

Time to root cause: < 2 hours (with parallel execution)
Fix iterations: < 3 (if root cause correct)
Regression rate: < 10% (good code review)

Outcome Metrics

Problem resolved: Yes/No
Stability duration: Steps until next issue
Knowledge captured: Documentation complete

📖 Related Skills

iterative-code-review: Use after fixes applied for quality verification
code-reviewer: Standalone code quality checks
debugger: When tests fail during verification phase
task-planner-analyzer: Can be used standalone for planning

Remember: This is a flexible workflow, not a rigid process. Adapt to your specific problem while maintaining the core principles: evidence-based, parallel execution, root-cause focused, and well-documented.

research-debug

Safety Notice

Copy this and send it to your AI assistant to learn

Research-Driven Debugging (research-debug)

🎯 사용 시점

✅ Ideal Use Cases

❌ Not Suitable For

📋 Workflow Phases

Phase 1: Evidence Gathering (병렬 실행)

Step 1.1: Web Research

Step 1.2: Code Analysis (Task-Planner-Analyzer)

Step 1.3: Start Safe Fixes (Optional)

Phase 2: Root Cause Triangulation

Cross-Reference Literature ↔ Code

Prioritize by Impact

Phase 3: Iterative Fix-and-Verify

Fix Loop

Apply Fixes with Modular-Code-Architect

Verify with Code-Reviewer

Verification Criteria

Phase 4: Documentation

Create Analysis Document

Update Project Memory

🔍 Monitoring & Debugging

Key Metrics to Watch

Early Warning Signs

📚 Example: GRPO Collapse

Phase 1: Evidence Gathering

Phase 2: Triangulation

Phase 3: Fix & Verify

Phase 4: Documentation

⚠️ Common Pitfalls

1. Fixing Symptoms Instead of Root Causes

2. Serial Execution (Wasting Time)

3. Ignoring Literature

4. Not Documenting Failures

🚀 Quick Start Template

🎓 Success Criteria

Process Metrics

Outcome Metrics

📖 Related Skills

Source Transparency

Related Skills

iterative-code-review

codex-iterative-solver

code-architecture-writer