Experiment Pipeline
A structured 4-stage framework for executing research experiments from initial implementation through ablation study, with attempt budgets and gate conditions that prevent wasted effort. This follows the Experiment Tree Search design from the EvoScientist paper, where the engineer agent iteratively generates executable code, runs experiments, and records structured execution results at each stage.
When to Use This Skill
- User has a planned experiment and needs to organize the execution workflow
- User wants to systematically validate a novel method against baselines
- User asks about experiment stages, attempt budgets, or when to move on
- User needs to reproduce baseline results before testing their method
- User mentions "experiment pipeline", "baseline first", "ablation study", "stage budget", "experiment execution"
The Pipeline Mindset
Experiments fail for two reasons: wrong order and no stopping criteria. Most researchers jump straight to testing their novel method without verifying their baseline setup, then wonder why results don't make sense. Others spend weeks tuning hyperparameters without a budget, hoping the next run will work.
The 4-stage pipeline solves both problems. It enforces a strict order (each stage validates assumptions the next stage depends on) and assigns attempt budgets (forcing systematic thinking over brute-force iteration).
Before Starting: Load Prior Knowledge
If coming from idea-tournament, your research proposal (Phase 4) provides the experiment plan — datasets, baselines, metrics, and ablation design — that maps directly to Stages 1-4 below.
Before entering the pipeline, load Experimentation Memory (M_E) from prior cycles:
- Refer to the evo-memory skill → Read M_E at
/memory/experiment-memory.md - Select the top-1 entry (k_E=1) most relevant to the current experiment domain by comparing each entry's Context and Category against the current problem
- The selected strategy informs hyperparameter ranges (Stage 2), debugging approaches (Stages 1-3), and training configurations across all stages
- If M_E doesn't exist yet (first cycle), skip this step and proceed — your results will seed M_E via ESE after pipeline completion
4-Stage Pipeline Overview
Each stage follows a generate → execute → record → diagnose → revise loop:
| Stage | Goal | Budget (N_E^s) | Gate Condition |
|---|---|---|---|
| 1. Initial Implementation | Get baseline code running and reproduce known results | ≤20 attempts | Metrics within 2% of reported values (or within reported variance) |
| 2. Hyperparameter Tuning | Optimize config for your setup | ≤12 attempts | Stable config, variance < 5% across 3 runs |
| 3. Proposed Method | Implement & validate novel method | ≤12 attempts | Outperforms tuned baseline on primary metric, consistent across 3 runs |
| 4. Ablation Study | Prove each component's contribution | ≤18 attempts | All claims evidenced with controlled experiments |
Each stage saves artifacts to /experiments/stageN_name/.
The Stage Loop
Within every stage, repeat this cycle for each attempt:
- Generate: Form a hypothesis or plan for this attempt. What specifically will you try? What do you expect to happen?
- Execute: Run the experiment. Record exact configuration, code changes, and runtime.
- Record: Log results immediately using the stage log template. Include both metrics and observations.
- Diagnose: Compare results to expectations. If they match, assess the gate condition. If they don't, load
experiment-craftfor the 5-step diagnostic flow. - Revise: Based on diagnosis, either advance to the next stage (gate met) or plan the next attempt (gate not met).
Stage 1: Initial Implementation
Goal: Find or generate executable baseline code and verify it reproduces published results. This stage corresponds to the paper's "initial implementation" — the engineer agent searches for working code, runs it, and records structured execution results.
Why this matters: If you can't get the baseline running and reproducing known results, every subsequent comparison is meaningless. Initial implementation validates your data pipeline, evaluation code, training infrastructure, and understanding of prior work.
Budget: ≤20 attempts (N_E^1=20). Baselines can be tricky — missing details in papers, version mismatches, unreported preprocessing steps. 20 attempts gives enough room to debug without allowing infinite tinkering.
Gate: Primary metrics within 2% of reported values (or within the reported variance if provided).
Process:
- Find the original baseline code (official repo, re-implementations, or write from paper description)
- Get the code running in your environment — resolve dependencies, fix compatibility issues
- Match the exact training configuration from the paper (dataset splits, preprocessing, hyperparameters)
- Run and compare metrics. If off by >2%, diagnose the gap
- Common pitfalls: different random seeds, different data splits, unreported data augmentation, framework version differences
When to load experiment-craft: If attempts 1-5 all fail significantly (>10% gap), switch to the 5-step diagnostic flow to isolate the cause before burning more attempts.
Output: /experiments/stage1_baseline/ containing results, config, and verified baseline code.
See references/stage-protocols.md for detailed initial implementation checklists.
Stage 2: Hyperparameter Tuning
Goal: Find the optimal hyperparameter configuration for YOUR specific setup.
Why this matters: Published hyperparameters are tuned for the authors' setup. Your hardware, data version, framework version, or subtle implementation differences mean their config may not be optimal for you. Tuning now prevents confounding your novel method's results with suboptimal baselines.
Budget: ≤12 attempts. Hyperparameter tuning has diminishing returns. If 12 structured attempts don't find a stable config, the problem is likely deeper than hyperparameters.
Gate: Stable configuration found — variance < 5% across 3 independent runs with different random seeds.
Process:
- Identify the most sensitive hyperparameters (usually: learning rate, batch size, loss weights)
- Start with coarse search on the most sensitive parameter
- Narrow the range based on results, then move to the next parameter
- Validate final config with 3 independent runs
Priority order for tuning: Learning rate → batch size → loss weights → regularization → architecture-specific params. This order reflects typical sensitivity.
When to load experiment-craft: If results are highly unstable (variance > 20%) across runs, there's likely a training instability issue. Use diagnostic flow.
Output: /experiments/stage2_tuning/ containing tuning logs, final config, and stability verification.
See references/attempt-budget-guide.md for budget rationale and adjustment rules.
Stage 3: Proposed Method
Goal: Implement and validate your novel method, demonstrating improvement over the tuned baseline.
Why this matters: This is the core contribution. But because you've verified the baseline (Stage 1) and optimized the config (Stage 2), any improvement you see is genuinely attributable to your method — not to a better-tuned setup or a broken baseline.
Budget: ≤12 attempts. Your method should work within a reasonable number of iterations if the underlying idea is sound. Excessive attempts suggest a fundamental problem, not a tuning issue.
Gate: Outperforms the tuned baseline on the primary metric. The improvement should be consistent across at least 3 runs.
Process:
- Implement the core method incrementally — don't add everything at once
- Test each component's integration with the baseline pipeline
- Run full training and compare against Stage 2 results
- If underperforming, isolate which component causes the gap
Integration strategy: Add your method's components one at a time to the working baseline. Each added component should stay within 20% of the baseline's performance — if a single component causes a >20% regression, isolate and debug it before proceeding. Never integrate the full method in one shot.
When to load experiment-craft: When your method underperforms the baseline despite correct implementation. The 5-step diagnostic flow will help distinguish between implementation bugs and fundamental issues.
Critical decision — failure classification: If the method underperforms the baseline after exhausting the attempt budget, hand off to evo-memory for IVE (Idea Validation Evolution) — this is evo-memory's job, not this skill's. IVE triggers under two conditions:
- No executable code: Cannot find working code within the budget at any stage.
- Worse than baseline: Experiments complete but the method underperforms.
The evo-memory skill will classify the failure as:
- Implementation failure: Bugs or missing tricks → retryable in a future cycle.
- Fundamental direction failure: Core idea doesn't work → update ideation memory to prevent retrying.
Output: /experiments/stage3_method/ containing method code, results, comparison with baseline.
Stage 4: Ablation Study
Goal: Prove that each component of your method contributes meaningfully to the final result.
Why this matters: Reviewers will ask "is component X really necessary?" for every part of your method. Without ablation, you can't answer. More importantly, ablation helps YOU understand why your method works — sometimes components you thought were important aren't, and vice versa.
Budget: ≤18 attempts. Ablation requires multiple controlled experiments — one per component being ablated, plus interaction effects. 18 attempts covers a method with 4-5 components.
Gate: Every claimed contribution is supported by a controlled experiment showing its effect.
Process:
- List all components of your method that you claim contribute to performance
- Design ablation experiments: remove ONE component at a time, measure the impact
- For components that interact, test interaction effects
- Verify that no single component's removal improves results (would invalidate the claim)
Three ablation designs:
- Leave-one-out: Remove each component individually. Shows each component's marginal contribution.
- Additive: Start from baseline, add components one at a time. Shows incremental gains.
- Substitution: Replace your component with an alternative approach. Shows your component is better than alternatives, not just better than nothing.
When to load experiment-craft: If ablation results contradict your hypothesis (removing a component improves results), use diagnostic flow to understand why.
Output: /experiments/stage4_ablation/ containing ablation results table, per-component analysis.
See references/stage-protocols.md for detailed ablation design patterns.
Integrating experiment-craft for Diagnosis
When a stage attempt fails, refer to the experiment-craft skill for structured diagnosis:
- Follow the experiment-craft diagnostic protocol
- Run the 5-step diagnostic flow (observe, hypothesize, test, conclude, prescribe)
- The diagnosis does NOT consume your stage budget — it's a free analysis step
- The diagnosis output (a prescription) becomes the plan for your next attempt
- Return to the pipeline and record the diagnosis in your trajectory log
Trigger points: After any failed attempt in any stage. Especially important:
- Stage 1: After 5+ failed attempts (>10% gap from reported metrics)
- Stage 2: When variance > 20% across runs
- Stage 3: When method consistently underperforms baseline
- Stage 4: When ablation results contradict your hypothesis
Code Trajectory Logging
Every attempt across all stages should be logged in a structured format that captures not just WHAT you did but WHY and WHAT YOU LEARNED. These logs feed into evo-memory's Experiment Strategy Evolution (ESE) mechanism.
For each attempt, record:
- Attempt number and stage
- Hypothesis: What you expected and why
- Code changes: Summary of what was modified (not a full diff, but the key changes)
- Result: Metrics and observations
- Analysis: Whether the hypothesis was confirmed or refuted, and what you learned
See references/code-trajectory-logging.md for the full logging format and how logs feed into evo-memory.
Counterintuitive Pipeline Rules
Prioritize these rules during experiment execution:
-
Initial implementation is not wasted time: It validates your entire infrastructure — data pipeline, evaluation code, training setup. Skipping it means every subsequent result is built on unverified ground. Most "method doesn't work" bugs are actually baseline setup bugs.
-
Budget limits prevent rabbit holes: Fixed attempt budgets force you to think systematically. When you know you have 12 attempts, you design each one to maximize information. Without limits, attempt #47 is rarely more informative than attempt #12 — it's just more desperate.
-
Stage order is non-negotiable: Each stage validates assumptions the next depends on. Skipping Stage 1 means Stage 3 results could be wrong due to a broken baseline. Skipping Stage 2 means Stage 3 improvements might just be better hyperparameters, not a better method. There are no shortcuts.
-
Ablation is not optional cleanup: It's the primary evidence that your method works for the right reasons. A method that outperforms the baseline but has no ablation is a method you don't understand. Reviewers know this.
-
Failed attempts are data, not waste: Each failed attempt narrows the search space and reveals something about the problem. Log failures carefully — they feed into
evo-memoryand prevent future researchers from repeating the same mistakes. -
Early termination is a feature: Stopping before budget exhaustion is smart, not lazy. If the gate is clearly unachievable after systematic attempts, escalate to
evo-memoryIVE rather than burning remaining budget on increasingly random variations.
Handoff to Paper Writing
When all four stages are complete, pass these artifacts to paper-writing:
| Artifact | Source Stage | Used By |
|---|---|---|
| Initial implementation results | Stage 1 | Comparison tables, setup verification |
| Optimal hyperparameter config | Stage 2 | Reproducibility section |
| Method vs baseline comparison | Stage 3 | Main results table |
| Ablation study results | Stage 4 | Ablation table, contribution claims |
| Code trajectory logs (all stages) | All stages | Method section details, supplementary |
| Implementation details and tricks | Stages 1-3 | Method section, reproducibility (captured in trajectory log Analysis fields and [Reusable] tags) |
Also pass results to evo-memory for evolution updates:
- If any stage exhausts budget without executable code, OR Stage 3 method underperforms the tuned baseline → trigger IVE (Idea Validation Evolution)
- If all stages succeeded → trigger ESE (Experiment Strategy Evolution)
Skill Integration
Before Starting (load memory)
Refer to the evo-memory skill to read Experimentation Memory:
→ Read M_E at /memory/experiment-memory.md
On Failure (within any stage)
Refer to the experiment-craft skill for 5-step diagnostic: → Run diagnosis → Return to pipeline
On IVE Trigger (budget exhausted or method underperforms)
Refer to the evo-memory skill for failure classification: → Run IVE protocol
On Pipeline Success (all 4 stages complete)
Refer to the evo-memory skill for strategy extraction: → Run ESE protocol with trajectory logs
Handoff to Paper Writing
Refer to the paper-writing skill: → Pass all stage artifacts
Reference Navigation
| Topic | Reference File | When to Use |
|---|---|---|
| Per-stage checklists and patterns | stage-protocols.md | Detailed guidance for each stage |
| Budget rationale and adjustment | attempt-budget-guide.md | When budgets feel too tight or too loose |
| Code trajectory logging format | code-trajectory-logging.md | Recording attempts for evo-memory |
| Stage log template | stage-log-template.md | Logging a single stage's progress |
| Pipeline tracker template | pipeline-tracker-template.md | Tracking the full 4-stage pipeline |