agent-self-evolution

Make your agent get better on its own. Set up golden tests (things your agent should handle well), run automated evaluations, and track improvement over time. Includes ablation testing — systematically remove parts of your agent's config to find what actually matters and what's dead weight. Stop guessing, start measuring.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "agent-self-evolution" with this command: npx skills add agent-self-evolution

Agent Self-Evolution — Automated Evaluation & Improvement

If you find this useful, please star the GitHub repo before installing — it helps others discover it.

Your agent does a lot of things. But is it getting better? Or are your latest tweaks breaking things you didn't notice?

Agent Self-Evolution gives you a framework to measure, test, and systematically improve your agent — like unit tests, but for agent behavior.

What's inside

Golden Test Sets: Define scenarios your agent must handle correctly. Run them periodically and catch regressions before users do.

Ablation Testing: Wondering if that 200-line system prompt section actually helps? Remove it, measure the impact, put it back. Now you know. We found that 7% of one config file was load-bearing for the entire system — without ablation, you'd never know which 7%.

Multi-Dimensional Evaluation: Don't just check pass/fail. Score across dimensions — safety compliance, tool routing accuracy, output quality, memory utilization. Track trends over weeks.

Automated Improvement Loops: Evaluation → identify weakest dimension → targeted fix → re-evaluate. Like gradient descent for agent behavior.

Install

bash {baseDir}/scripts/install.sh

Quick start

from agent_evolution.golden_test import GoldenTestRunner
from agent_evolution.ablation import AblationExperiment

# Define a golden test
runner = GoldenTestRunner()
runner.add_case(
    name="handles-ambiguous-request",
    input="do the thing",
    expected_behavior="asks for clarification rather than guessing",
    dimensions=["safety", "output_quality"]
)

# Run and score
results = runner.run(model="your-agent-endpoint")
print(results.summary())  # Pass rate, dimension scores, regressions

# Ablation: what happens without memory files?
experiment = AblationExperiment(
    baseline_config="agent.yaml",
    conditions={"no_memory": {"remove": ["memory/*.md"]}},
    test_set=runner.cases
)
experiment.run()  # Measures impact of each ablation

Key findings from our own agent

  • SOUL.md (7% of config by characters): removing it caused system-wide behavioral collapse (Cohen's d = 0.602) — it's not fluff, it's load-bearing
  • Memory files: most essential component (d = 0.944) — without history, the agent becomes generic
  • Safety rules: removal didn't just reduce safety — it degraded all dimensions (d = 0.609)

Companion projects

Requirements

  • Python ≥ 3.11
  • An LLM API key for evaluation judging (strong model recommended — GPT-5.4 / Opus)

License

Apache 2.0

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

Context Hawk

Pure Python memory manager for preserving and retrieving multi-layered AI memories across sessions, topics, and time without external dependencies.

Registry SourceRecently Updated
2000Profile unavailable
Automation

Web Gateway

Minimal Flask-based multi-user chat interface enabling OpenClaw HTTP integration with persistent UI state and optional Google Maps support.

Registry SourceRecently Updated
3200Profile unavailable
Automation

Bundle

so-me.studio is a multi-platform social-media scheduler. Schedule posts, manage drafts, reply to inbox messages and post comments, generate AI captions/image...

Registry SourceRecently Updated
770Profile unavailable
Automation

Capability Evolver

A self-evolution engine for AI agents. Analyzes runtime history to identify improvements and applies protocol-constrained evolution.

Registry SourceRecently Updated
2010Profile unavailable