memory-bench-designer

Designs a custom agent-memory benchmark for the user's specific use case. Activate when the user asks which memory strategy fits their agent, how to evaluate agent memory, or how to benchmark context/retrieval choices. Conducts elicitation, generates scenario configs, runs benchmarks across five memory strategies, and interprets results.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "memory-bench-designer" with this command: npx skills add tatsuko-tsukimi/memory-bench-designer

Memory Bench Designer

An agent memory benchmark designer. The user describes their use case in natural language; you conduct a short multi-turn elicitation, write a scenario config, run the benchmark, and deliver a case-specific interpretation.

The central premise: no single memory strategy wins across use cases. Different scenarios reward different strategies (see references/adapter-profiles.md for empirical evidence). Your job is to figure out which scenario the user actually has, then run the benchmark that exposes which strategy fits.

Four-stage flow

Stage 1 Understanding — conversation with the user (3–5 turns) Stage 2 Ideation — generate scenario.yaml + weights.yaml Stage 3 Rollout — invoke the runner CLI Stage 4 Judgment — interpret the results.md for this specific use case

After Stage 4, always offer: "Want to refine the scenario and re-run?" This is the AdaTest-style inner loop.

Stage 1 — Understanding

Goal: extract enough about the user's use case to fill in the scenario DSL.

Turn 1 — examples, not criteria. Ask:

"Give me 1–2 concrete examples of things your agent's memory should keep and retrieve later, and 1–2 examples of things it should discard or at least de-prioritize. Don't worry about defining the rules — just the examples."

Rationale: EvalGen's "criteria drift" finding. Users can't define criteria upfront; they can recognize good/bad examples.

Turn 2 — session shape. Ask two short questions:

"How many conversations/sessions does a typical user have with your agent before memory matters? And how long is one session — roughly how many turns?"

If the user is vague, offer defaults: 10 sessions × 40 steps. These are runner defaults.

Turn 3 — taxonomy check. Show the 4-family × 8-dimension matrix from references/taxonomy.md. Ask which 2–3 dimensions matter most for this use case. Do not force the user to rank all 8 — cognitive load is too high. You are looking for which families to weight.

Turn 4 (optional) — archetype mix. If the use case is ambiguous, show 3 candidate archetype mixes (see references/use-case-patterns.md), let the user pick or modify. Never show more than 3 candidates at once (AdaTest's 3–7 cap, we lean to 3).

By the end of Stage 1 you should know:

  • Archetype mix: fractions for core / evolving / episode / noise
  • Context evolution: random / narrow-band-drift / stable / mode-shifts
  • Themes: 3–6 short lists of vocabulary tokens (ask the user for domain words if they're non-obvious)
  • Which families they care about (for the Judgment stage)

If anything is ambiguous, default to the closest pattern in references/use-case-patterns.md and tell the user which pattern you chose and why.

Stage 2 — Ideation

Write two files into the user's current working directory:

  • scenario-<name>.yaml — the scenario config
  • weights-<name>.yaml — family weights for Judgment (optional)

Use templates/scenario.yaml.tmpl and templates/weights.yaml.tmpl as starting points. Substitute the values from Stage 1.

Show the user the generated scenario.yaml and ask: "Look right, or tweak anything before we run?" Keep this confirmation to one round — don't re-litigate Stage 1.

Stage 3 — Rollout

Invoke the runner via Bash:

memory-bench run --scenario scenario-<name>.yaml --out results/<name>/ --embedding --composite

The --embedding flag enables the sentence-transformers adapter (first run downloads ~90 MB model). The --composite flag enables the weighted multi-signal adapter. Both are recommended — without them you only get three cheap baselines and the leaderboard is thin.

The runner writes results/<name>/results.md and results/<name>/results.json. Read the markdown file.

Expected runtime: 1–5 minutes. If it's slower, sentence-transformers is doing a cold model download — this is normal on first run.

Stage 4 — Judgment

Read results.md. Do not just paste it back to the user. Write a case-specific interpretation with three sections:

1. Capability profile. For each family the user said matters in Stage 1, state the winner, its score, and whether that score is high or low relative to the other scenarios in references/adapter-profiles.md. A winner with score 0.4 means "best available but still weak" — say that out loud.

2. Tradeoffs observed. Point to 1–2 dimensions where a non-winner adapter came close, and what that means. Example: "Composite edges out Embedding in Update Coherence by 5%, but loses Personalization by 10%. For your use case, you care more about X, so Embedding is the safer default."

3. Recommended starting strategy. One sentence: "Start with <adapter> because <why>. If you see <symptom> in production, try <alternative>." Be specific.

After these three sections, ask: "Want to refine the scenario and re-run?" Common refinements:

  • Bump up an archetype fraction that felt underrepresented
  • Switch context evolution type
  • Add or remove themes
  • Adjust weights.yaml to shift family priorities

Key UX rules (full detail in references/elicitation-flow.md)

  • Grade before criteria — ask for examples before asking for rules
  • Cap at 7 — never show more than 7 candidates/options/dimensions at once; prefer 3
  • Ranking always visible — when you show candidates, show why they're ranked in that order
  • Iterate every 5–8 interactions — surface pattern-detected summaries, don't let the conversation wander
  • Organization optional — don't force a taxonomy on the user upfront; let structure emerge from the examples they give

References

  • references/taxonomy.md — the 4×8 matrix shown in Turn 3
  • references/adapter-profiles.md — empirical profile of each strategy (what it wins, what it loses)
  • references/use-case-patterns.md — canonical patterns (game / companion / RAG / coding)
  • references/elicitation-flow.md — the UX rules above, with rationale
  • examples/game-ai-walkthrough.md — a full game-AI scenario elicitation and result
  • examples/npc-cognition-walkthrough.md — long-running NPC with stable persona
  • examples/coding-agent-walkthrough.md — code/PR/design memory with frequent supersedes
  • templates/scenario.yaml.tmpl — the scenario DSL skeleton
  • templates/weights.yaml.tmpl — family weights skeleton

What this skill does not do

  • It does not call any LLM judges — all metrics are mechanical
  • It does not evaluate actual agent responses — it evaluates the retrieval layer feeding them
  • It does not benchmark external memory services (Mem0, Zep, Letta) — it benchmarks algorithmic primitives (Recency, BM25, ACT-R, Embedding, Composite)
  • It does not replace production telemetry — it de-risks the initial strategy choice before you build

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Canonry Setup

Agent-first AEO operating platform.

Registry SourceRecently Updated
4151arberx
Automation

Pilot Service Agents Entertainment

Games, manga/anime, trivia, and fandom APIs — PokeAPI, Jikan, CheapShark, misc. Use this skill when: 1. Pokémon / PokeAPI lookups 2. Anime or manga metadata...

Registry SourceRecently Updated
Automation

Pilot Service Agents Economics

Macroeconomic indicators — IMF DataMapper, World Bank, Eurostat SDMX, Coinbase reference prices. Use this skill when: 1. Country-level GDP, inflation, or une...

Registry SourceRecently Updated
Automation

Pilot Service Agents Flights

Aircraft tracking and aviation weather — ADS-B feeds (ICAO + bbox), airport directory, METAR/TAF/SIGMET. Use this skill when: 1. Live aircraft positions by I...

Registry SourceRecently Updated