PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

Python 3.10+
uv package manager
OpenClaw instance (this agent)

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

Task	Category	Description
`task_00_sanity`	Basic	Verify agent works
`task_01_calendar`	Productivity	Calendar event creation
`task_02_stock`	Research	Stock price lookup
`task_03_blog`	Writing	Blog post creation
`task_04_weather`	Coding	Weather script
`task_05_summary`	Analysis	Document summarization
`task_06_events`	Research	Conference research
`task_07_email`	Writing	Email drafting
`task_08_memory`	Memory	Context retrieval
`task_09_files`	Files	File structure creation
`task_10_workflow`	Integration	Multi-step API workflow
`task_11_clawdhub`	Skills	ClawHub interaction
`task_12_skill_search`	Skills	Skill discovery
`task_13_image_gen`	Creative	Image generation
`task_14_humanizer`	Writing	Text humanization
`task_15_daily_summary`	Productivity	Daily digest
`task_16_email_triage`	Email	Inbox triage
`task_17_email_search`	Email	Email search
`task_18_market_research`	Research	Market analysis
`task_19_spreadsheet_summary`	Analysis	Spreadsheet analysis
`task_20_eli5_pdf_summary`	Analysis	PDF simplification
`task_21_openclaw_comprehension`	Knowledge	OpenClaw docs comprehension
`task_22_second_brain`	Memory	Knowledge management

Command Line Options

Option	Description
`--model`	Model identifier (e.g., `anthropic/claude-sonnet-4`)
`--suite`	`all`, `automated-only`, or comma-separated task IDs
`--output-dir`	Results directory (default: `results/`)
`--timeout-multiplier`	Scale task timeouts for slower models
`--runs`	Number of runs per task for averaging
`--no-upload`	Skip uploading to leaderboard
`--register`	Request new API token for submissions
`--upload FILE`	Upload previous results JSON

Token Registration

To submit results to the leaderboard:

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

YAML frontmatter (id, name, category, grading_type, timeout)
Prompt section
Expected behavior
Grading criteria
Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

Model rankings by overall score
Per-task breakdowns
Historical performance trends

pinchbench

Safety Notice

Copy this and send it to your AI assistant to learn

PinchBench Benchmark Skill

Prerequisites

Quick Start

Available Tasks (23)

Command Line Options

Token Registration

Results

Adding Custom Tasks

Leaderboard

Source Transparency

Related Skills

excel analysis

x-cmd-knowledge

here-now