autoresearch

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology. Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "autoresearch" with this command: npx skills add karpathy-autoresearch

autoresearch

Autonomously optimize any OpenClaw skill by running it repeatedly, scoring outputs against binary evals, mutating the prompt, and keeping improvements. Based on Karpathy's autoresearch methodology.

Triggers

Use when: optimize this skill, improve this skill, run autoresearch on, make this skill better, self-improve skill, benchmark skill, eval my skill, run evals on.

Description

Autonomous prompt/strategy optimization using Karpathy's autoresearch pattern. Mutate → evaluate → keep improvements. Works on anything with a measurable score: trading strategies, content scripts, thumbnails, ad copy, email subjects.

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  1. BASELINE │────▶│  2. MUTATE   │────▶│  3. EVALUATE │────▶│  4. DECIDE   │
│  Score the   │     │  Change one  │     │  Run scoring │     │  Better?     │
│  current     │     │  thing       │     │  function    │     │  Keep : Revert│
│  version     │     │              │     │              │     │              │
└─────────────┘     └─────────────┘     └─────────────┘     └──────┬───────┘
                                                                    │
                                                              Loop back to 2

Instructions

Step 1: Identify the Mutable File

The mutable file is the thing you're optimizing. It can be:

  • A SKILL.md prompt/instructions
  • A trading strategy config (thresholds, parameters)
  • A content template (YouTube script format, ad copy structure)
  • Any text file where changes produce measurable differences

Create or identify this file. Example:

my-skill/
├── SKILL.md          ← this is your mutable file
├── eval/
│   ├── test_cases.json
│   └── score.py

Step 2: Create an Evaluation Function

Your eval function must:

  1. Take the current mutable file as input
  2. Run it against test cases
  3. Return a numeric score (higher = better)

The eval can be anything:

  • LLM-as-judge: Send output to an LLM, ask it to score 1-100
  • Backtest: Run a strategy against historical data, measure Sharpe/returns
  • A/B metrics: CTR, engagement, conversion rate
  • Binary pass/fail: Count how many test cases pass out of N

Template eval function (customize for your domain):

# eval/score.py
import json
import sys

def evaluate(mutable_file_path: str, test_cases_path: str) -> float:
    """
    Score the current version of the mutable file.
    Returns a float — higher is better.
    """
    with open(mutable_file_path) as f:
        current_version = f.read()
    
    with open(test_cases_path) as f:
        test_cases = json.load(f)
    
    scores = []
    for case in test_cases:
        # YOUR SCORING LOGIC HERE
        # Example: run the prompt, compare output to expected
        score = run_and_score(current_version, case)
        scores.append(score)
    
    return sum(scores) / len(scores)

if __name__ == "__main__":
    score = evaluate(sys.argv[1], sys.argv[2])
    print(f"SCORE: {score}")

Step 3: Run the Autoresearch Loop

The loop follows this exact pattern:

1. Git init (if not already) — every experiment is a commit
2. Run eval on current version → get BASELINE score
3. For each experiment (1..N):
   a. Read the current mutable file
   b. Generate a MUTATION (change one thing — a threshold, a phrase, a rule)
   c. Write the mutated version
   d. Run eval → get NEW score
   e. If NEW > BASELINE:
      - Git commit with message: "exp-{N}: {description} | score: {baseline} → {new}"
      - Update BASELINE = NEW
      - Log: "✅ KEPT — improvement"
   f. If NEW <= BASELINE:
      - Git checkout the mutable file (revert)
      - Log: "❌ REVERTED — no improvement"
4. Print final summary: experiments run, improvements found, final score

Agent Instructions for Running the Loop

When the user says "run autoresearch on X", follow this procedure:

  1. Locate the mutable file — ask the user or infer from context
  2. Locate or create the eval function — the user must have a way to score
  3. Initialize git tracking in the project directory
  4. Run baseline eval — record the starting score
  5. Begin experiment loop:
    • Read the mutable file
    • Think about what single change might improve the score
    • Make the change (be specific — change ONE thing per experiment)
    • Run eval
    • Keep or revert based on score
    • Log the result
  6. Continue for N experiments (default: 20, or until user stops)
  7. Report results:
    • Starting score → Final score
    • Number of experiments run
    • Number of improvements kept
    • Summary of what changes worked

Mutation Strategy

Good mutations change ONE thing at a time:

  • Numeric parameters: Adjust thresholds, weights, window sizes
  • Prompt wording: Rephrase instructions, add/remove constraints
  • Structure: Reorder sections, add examples, remove redundancy
  • Rules: Add a new rule, tighten an existing one, relax a constraint

Bad mutations change everything at once — you can't learn what worked.

Step 4: Git Tracking

Every experiment MUST be tracked in git:

# Before starting
git init
git add -A
git commit -m "baseline: score {X}"

# After each successful mutation
git add -A
git commit -m "exp-{N}: {what changed} | {old_score} → {new_score}"

# After each failed mutation
git checkout -- {mutable_file}

This gives you:

  • Full history of every experiment
  • Ability to diff any two versions
  • Easy rollback if something breaks
  • A log of what mutations worked vs didn't

Proven Results

Case Study 1: Gold Trading Strategy

  • Task: Optimize XAUUSD trading parameters
  • Mutable file: Strategy config (EMA periods, momentum threshold, position sizing)
  • Eval function: Backtest on historical data → Sharpe ratio
  • Baseline: Sharpe 5.80
  • Experiments: 86 in 25 minutes
  • Final: Sharpe 12.23 (+111%)
  • Key discoveries: Momentum threshold 0.003→0, EMA 8/24→5/11, position sizing optimization
  • See: references/gold-results.md

Case Study 2: YouTube Shorts Scripts

  • Task: Optimize script-writing prompt for higher quality scores
  • Mutable file: SKILL.md prompt instructions
  • Eval function: LLM judge scoring 1-100
  • Baseline: 94.3/100
  • Experiments: 11
  • Final: 96.7/100 (+2.5%)
  • Key discoveries: Atomic sentences, strict 40-50 word range, stronger negative examples
  • See: references/youtube-results.md

Example Usage

User: "Run autoresearch on my email subject line skill"

Agent workflow:

  1. Read the skill's SKILL.md (mutable file)
  2. Create eval: generate 20 test emails → score subject lines with LLM judge (1-100 on open-rate prediction)
  3. Baseline: 72.4/100
  4. Experiment 1: Add "use numbers in subject lines" → 74.1 ✅ KEPT
  5. Experiment 2: Add "max 6 words" → 71.8 ❌ REVERTED
  6. Experiment 3: Add "start with a verb" → 75.3 ✅ KEPT
  7. ... continue for 20 experiments
  8. Final: 79.2/100 (+9.4%)

User: "Optimize my trading strategy config"

Agent workflow:

  1. Read strategy.json (mutable file)
  2. Eval: run backtest script → Sharpe ratio
  3. Baseline: Sharpe 2.1
  4. Experiment 1: Lower stop-loss from 2% to 1.5% → Sharpe 2.3 ✅
  5. Experiment 2: Increase EMA fast period 12→15 → Sharpe 1.9 ❌
  6. ... continue
  7. Final: Sharpe 3.8 (+81%)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

gitlab-mr-reviewer

当需要审核 GitLab 合并请求、检查 MR diff 风险、发布 GitLab 审查评论、执行 approve/request changes,或发送 MR 审查通知时使用。

Registry SourceRecently Updated
1490whrime
General

Voice Transcriber Toolkit

Voice-to-Text Transcription Toolkit - 语音识别转文字,支持Whisper/Vosk引擎,批量处理,字幕导出 | Speech recognition & transcription with Whisper/Vosk engines, batch processing, su...

Registry SourceRecently Updated
General

Gigo Lobster Taster

🦞 GIGO · gigo-lobster-taster: 正式试吃模式:跑完整评测,默认上传云端、生成个人结果页并进入排行榜。 Triggers: 试吃我的龙虾 / 品鉴我的龙虾 / lobster taste / lobster taster.

Registry SourceRecently Updated
General

Gigo Lobster Local

🦞 GIGO · gigo-lobster-local: 本地模式:跑完整评测,但不上云、不注册个人结果页,证书二维码回到官网首页。 Triggers: 本地试吃龙虾 / 离线试吃龙虾 / local lobster taste / offline lobster taste.

Registry SourceRecently Updated