eval-recipes-runner

eval-recipes Runner Skill

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "eval-recipes-runner" with this command: npx skills add rysweet/amplihack/rysweet-amplihack-eval-recipes-runner

eval-recipes Runner Skill

Purpose

Run Microsoft's eval-recipes benchmarks to validate amplihack improvements against baseline agents.

When to Use

  • User asks to "test with eval-recipes"

  • User says "run the evals" or "benchmark this change"

  • User wants to validate improvements against codex/claude_code

  • Testing a PR branch to prove it improves scores

Capabilities

I can run eval-recipes benchmarks to:

  • Test specific amplihack branches

  • Compare against baseline agents (codex, claude_code)

  • Run specific tasks (linkedin_drafting, email_drafting, etc.)

  • Compare before/after scores for PRs

  • Generate reports with score improvements

How It Works

Setup (One-Time)

Clone eval-recipes from Microsoft

git clone https://github.com/microsoft/eval-recipes.git ~/eval-recipes cd ~/eval-recipes

Copy our agent configs

cp -r $(pwd)/.claude/agents/eval-recipes/* data/agents/

Install dependencies

uv sync

Running Benchmarks

Test a specific branch:

Update install.dockerfile to use specific branch

Then run benchmark

cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

Compare before/after:

Test baseline (main)

uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting

Test PR branch (edit install.dockerfile to checkout PR branch)

uv run eval_recipes/main.py --agent amplihack_pr1443 --task linkedin_drafting

Compare scores

Available Tasks

Common tasks from eval-recipes:

  • linkedin_drafting

  • Create tool for LinkedIn posts (scored 6.5/100 before PR #1443)

  • email_drafting

  • Create CLI tool for emails (scored 26/100 before)

  • arxiv_paper_summarizer

  • Research tool

  • github_docs_extractor

  • Documentation tool

  • Many more in ~/eval-recipes/data/tasks/

Typical Workflow

When user says "test this change with eval-recipes":

  • Identify the branch/PR to test

  • Update agent config to use that branch:

In .claude/agents/eval-recipes/amplihack/install.dockerfile

RUN git clone https://github.com/rysweet/...git /tmp/amplihack &&
cd /tmp/amplihack &&
git checkout BRANCH_NAME &&
pip install -e .

  • Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

  • Run benchmark: cd ~/eval-recipes uv run eval_recipes/main.py --agent amplihack --task TASK_NAME --trials 3

  • Report scores and compare with baseline

Expected Scores

Baseline (main branch):

  • Overall: 40.6/100

  • LinkedIn: 6.5/100

  • Email: 26/100

With PR #1443 (task classification):

  • Expected: 55-60/100 (+15-20 points)

  • LinkedIn: 30-40/100 (creates actual tool)

  • Email: 45/100 (consistent execution)

Example Usage

User says: "Test PR #1443 with eval-recipes on the LinkedIn task"

I do:

  • Update install.dockerfile to checkout feat/issue-1435-task-classification

  • Copy to eval-recipes: cp -r .claude/agents/eval-recipes/* ~/eval-recipes/data/agents/

  • Run: cd ~/eval-recipes && uv run eval_recipes/main.py --agent amplihack --task linkedin_drafting --trials 3

  • Report results: "Score: 35.2/100 (up from 6.5 baseline)"

Prerequisites

  • eval-recipes cloned to ~/eval-recipes

  • API key in environment: export ANTHROPIC_API_KEY=sk-ant-...

  • Docker installed (for containerized runs)

  • uv installed: curl -LsSf https://astral.sh/uv/install.sh | sh

Notes

  • Benchmarks take 2-15 minutes per task depending on complexity

  • Multiple trials (3-5) give more reliable averages

  • Docker builds can be cached for speed

  • Results saved to .benchmark_results/ in eval-recipes repo

Automation

For fully autonomous testing:

Test suite for a PR

tasks="linkedin_drafting email_drafting arxiv_paper_summarizer" for task in $tasks; do uv run eval_recipes/main.py --agent amplihack --task $task --trials 3 done

Compare results

cat .benchmark_results//amplihack//score.txt

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

pptx

No summary provided by upstream source.

Repository SourceNeeds Review
General

lawyer-analyst

No summary provided by upstream source.

Repository SourceNeeds Review
General

economist-analyst

No summary provided by upstream source.

Repository SourceNeeds Review