pinchbench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pinchbench" with this command: npx skills add pinchbench/skill/pinchbench-skill-pinchbench

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

  • Python 3.10+
  • uv package manager
  • OpenClaw instance (this agent)

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

TaskCategoryDescription
task_00_sanityBasicVerify agent works
task_01_calendarProductivityCalendar event creation
task_02_stockResearchStock price lookup
task_03_blogWritingBlog post creation
task_04_weatherCodingWeather script
task_05_summaryAnalysisDocument summarization
task_06_eventsResearchConference research
task_07_emailWritingEmail drafting
task_08_memoryMemoryContext retrieval
task_09_filesFilesFile structure creation
task_10_workflowIntegrationMulti-step API workflow
task_11_clawdhubSkillsClawHub interaction
task_12_skill_searchSkillsSkill discovery
task_13_image_genCreativeImage generation
task_14_humanizerWritingText humanization
task_15_daily_summaryProductivityDaily digest
task_16_email_triageEmailInbox triage
task_17_email_searchEmailEmail search
task_18_market_researchResearchMarket analysis
task_19_spreadsheet_summaryAnalysisSpreadsheet analysis
task_20_eli5_pdf_summaryAnalysisPDF simplification
task_21_openclaw_comprehensionKnowledgeOpenClaw docs comprehension
task_22_second_brainMemoryKnowledge management

Command Line Options

OptionDescription
--modelModel identifier (e.g., anthropic/claude-sonnet-4)
--suiteall, automated-only, or comma-separated task IDs
--output-dirResults directory (default: results/)
--timeout-multiplierScale task timeouts for slower models
--runsNumber of runs per task for averaging
--no-uploadSkip uploading to leaderboard
--registerRequest new API token for submissions
--upload FILEUpload previous results JSON

Token Registration

To submit results to the leaderboard:

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

  • YAML frontmatter (id, name, category, grading_type, timeout)
  • Prompt section
  • Expected behavior
  • Grading criteria
  • Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

  • Model rankings by overall score
  • Per-task breakdowns
  • Historical performance trends

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Research

excel analysis

No summary provided by upstream source.

Repository SourceNeeds Review
Research

x-cmd-knowledge

No summary provided by upstream source.

Repository SourceNeeds Review
General

here-now

No summary provided by upstream source.

Repository SourceNeeds Review