autoagent

Automatically improve agent guidance through iterative testing and scoring. Use when you want to optimize prompts, AGENTS.md entries, or skill definitions using a Karpathy-style training loop with OpenClaw cron.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "autoagent" with this command: npx skills add PeterPCW/autoagent

Autoagent Skill

Optimize any agent guidance through automated testing and iterative improvement.

Quick Start

/autoagent

What It Does

  1. Setup Phase - Asks where your guidance lives and what it should do
  2. Creates Sandbox - Copies guidance to test folder with fixtures
  3. Runs Optimization Loop - Every 5 minutes via cron:
    • Analyzes current guidance
    • Proposes improvement
    • Tests with subagent
    • Scores result
    • Keeps or discards change
  4. Logs Everything - Check scores.md for history

Setup Phase (Every Invocation Starts Fresh)

Every invocation of /autoagent starts fresh with interactive setup questions.

Step 1: Ask Sandbox Location

Ask the user:

Where should I create the sandbox folder? Default: ../../autoagent-sandbox/ (resolves to /clawd/autoagent-sandbox/)

You can respond with:

  • Empty/default: Press enter to use ../../autoagent-sandbox/
  • Just a name: "news" creates ../../autoagent-news//clawd/autoagent-news/
  • Relative path: "agentDev/optimize" creates ../../agentDev/optimize//clawd/agentDev/optimize/
  • Absolute path: /some/other/path/optimize/ → exact path

Wait for their response (or empty for default).

Step 2: Discuss Success Criteria

Ask the user:

Let's define how we'll measure success. What does a "good" result look like for this task?

Follow up one at a time based on their response:

  • What specific outputs are expected?
  • What format should they be in?
  • What's the minimum viable quality?
  • Any edge cases to consider?

Once you have enough information, propose a draft scoring.md:

## Proposed Scoring Criteria

**Score Components:**
- [Component 1]: [X] points - [description]
- [Component 2]: [Y] points - [description]
- ...

**Total:** 100 points

**[Any additional notes]**

Wait for user approval or modifications.

Step 3: Ask About External Scripts/Tools

Ask the user:

Does the guidance rely on any scripts, tools, or external software?

  • If yes: Note each script/tool path and what functionality it provides
  • The autoagent should analyze these to recommend improvements

Step 4: Ask Cron Schedule

Ask the user:

Run optimization every 5 minutes (default), or different interval?

Step 5: Create Sandbox

After all questions answered, create the sandbox folder at the user-specified path:

sandbox/
├── guidance-under-test.md   # Copy of original guidance
├── current-guidance.md      # Same as guidance-under-test initially
├── fixtures/
│   └── test-cases.json      # {"cases": [{"input": "...", "expected": "..."}]}
├── scoring.md               # Scoring criteria document (user-approved)
├── scores.md                # Score history table
└── scripts/                  # (optional) Copy of referenced scripts/tools

Step 6: Set Up Cron

Use OpenClaw cron syntax to schedule the iteration agent:

  • Default: every 5 minutes (*/5 * * * *)
  • Command: invoke the iteration prompt with the sandbox path

Step 7: Confirm Start

Return confirmation message showing the resolved path:

"Optimization started at /clawd/autoagent-news/. I'll check back every 5 minutes. Monitor progress in scores.md."


Iteration Phase (Runs Every Cron Interval)

Each time the cron triggers, do the following:

Step 1: Analyze Current State

Read from the sandbox:

  • current-guidance.md - The guidance being optimized
  • scores.md - History of scores and changes
  • scoring.md - How to measure success
  • fixtures/test-cases.json - Test inputs (MUST read this to understand what the guidance is being tested against)

Review score history (last 10 runs or all available runs if fewer than 10 exist), identify patterns, note current score. When fewer than 10 runs exist, treat all available scores as the set for plateau detection.

Important: Load the test cases from fixtures/test-cases.json to understand what specific outputs/ behaviors are expected. The edit should address gaps revealed by test case failures or missing criteria.

Step 1b: Analyze External Scripts/Tools (If Applicable)

If the guidance references any scripts, tools, or external software:

  1. Locate each script/tool - Find the actual script files or binary locations
  2. Analyze the functionality - Read the code or documentation to understand what it does
  3. Identify improvement opportunities:
    • For open-source scripts: Can the script be modified to improve functionality?
    • For closed-source/compiled tools: Can wrapper behavior be improved? Can you recommend API/interface changes?
  4. Note findings in the iteration - If script improvements could help test scores, document them

Example outputs:

  • "Script X does Y but could do Z - recommend modification to add feature W"
  • "Tool A is closed-source, recommend changing prompt to work around limitation B"
  • "Script C has bug in function D - fix would improve test outcomes"

Step 2: Propose Edit

Generate ONE specific edit to the guidance that might improve the score.

Analyze Score History First:

  • Read scores.md to find the last 10 runs
  • Identify patterns: Which scoring criteria are consistently low?
  • Look for repeated failures - if the same criterion failed multiple times, that's your target
  • Check what changes were tried before (avoid repeating failed approaches)

Edit Selection Strategy (Priority Order):

  1. If scores exist: Target the lowest-scoring criteria from scoring.md
  2. If all scores high (90+): Add missing detail to any criteria marked as partial
  3. If only 1-2 runs: Assume baseline covered basics, add missing methodology
  4. Prioritize edits that affect multiple scoring criteria at once

The edit should:

  • Be specific and actionable (not vague like "improve clarity")
  • Address a weakness identified in scoring (target the lowest-scoring criteria)
  • Not be identical to recently tried changes (check scores.md for recent descriptions)
  • Include the exact text to add/remove/replace

Format:

## Proposed Edit

**Rationale:** Why this change might help

**Change:**

[Show exact diff or new text]

Step 3: Apply Edit

Write the edited guidance to current-guidance.md

Step 4: Run Test

Use a subagent to run the task with the new guidance:

  • Give the subagent current-guidance.md
  • Provide test inputs from fixtures/test-cases.json
  • Capture the output
  • Subagent invocation: Use sessions_spawn with task containing the full contents of current-guidance.md, include the test cases JSON inline in the task prompt, set timeoutSeconds to 120, and request the subagent to return the raw output (not just pass/fail)

Step 5: Score Result

Evaluate the output against scoring.md criteria. Generate a score 0-100.

Step 6: Log Decision

Append to scores.md:

| N   | Description of change | SCORE | keep/discard |

Where N is the run number (increment from last).

Step 7: Update Guidance

  • If score improved: Keep the edit (current-guidance.md is already updated)
  • If score declined: Revert current-guidance.md to previous version

Step 8: Check Plateau

If last 10 scores are within 5 points of each other:

  • Log "Plateau detected - pausing"
  • Notify user
  • Stop the cron (or pause and await user override)

Files Created in Sandbox

FileDescription
guidance-under-test.mdOriginal copy (read-only reference)
current-guidance.mdWorking version (edited each iteration)
fixtures/test-cases.jsonInput → expected output pairs
scoring.mdScoring methodology
scores.mdScore history log
scripts/(optional) Copies of referenced scripts/tools for analysis

Usage

  1. Invoke: /autoagent
  2. Answer setup questions
  3. Monitor scores.md for progress
  4. Copy improvements to original when satisfied
  5. Stop cron when done

Stopping

  • User can stop cron anytime
  • Auto-stops if score plateaus for 10 runs
  • Check scores.md for progress

Key Principles

  • Non-destructive: Original guidance stays in guidance-under-test.md
  • Learn from history: Don't repeat failed approaches
  • Be specific: Vague changes won't score well
  • Human in the loop: User defines success criteria, can override plateau detection

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Web3

Neko Futures Trader

Automated Binance Futures trading scanner with runner detection and price monitor. Features: - Runner detection (volume spike + momentum + breakout) - Real c...

Registry SourceRecently Updated
Web3

Bridge Stablecoin

Build USDC bridging with Circle Bridge Kit SDK and Crosschain Transfer Protocol (CCTP). Supports bridging USDC between EVM chains, between EVM chains and Sol...

Registry SourceRecently Updated
Web3

Binance Trade Hunter

Binance trading skill for coin analysis, pump detection, and one-click trading via Telegram. Use when the user wants to analyze crypto coins, check market tr...

Registry SourceRecently Updated
Web3

Use Usdc

USDC is Circle's stablecoin deployed across multiple blockchain ecosystems including EVM chains (Ethereum, Base, Arbitrum, Polygon, Arc) and Solana. Use this...

Registry SourceRecently Updated