pinchbench

Run PinchBench benchmarks to evaluate OpenClaw agent performance across real-world tasks. Use when testing model capabilities, comparing models, submitting benchmark results to the leaderboard, or checking how well your OpenClaw setup handles calendar, email, research, coding, and multi-step workflows.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "pinchbench" with this command: npx skills add pinchbench/pinchbench

PinchBench Benchmark Skill

PinchBench measures how well LLM models perform as the brain of an OpenClaw agent. Results are collected on a public leaderboard at pinchbench.com.

Prerequisites

  • Python 3.10+
  • uv package manager
  • OpenClaw instance (this agent)

Quick Start

cd <skill_directory>

# Run benchmark with a specific model
uv run benchmark.py --model anthropic/claude-sonnet-4

# Run only automated tasks (faster)
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite automated-only

# Run specific tasks
uv run benchmark.py --model anthropic/claude-sonnet-4 --suite task_01_calendar,task_02_stock

# Skip uploading results
uv run benchmark.py --model anthropic/claude-sonnet-4 --no-upload

Available Tasks (23)

TaskCategoryDescription
task_00_sanityBasicVerify agent works
task_01_calendarProductivityCalendar event creation
task_02_stockResearchStock price lookup
task_03_blogWritingBlog post creation
task_04_weatherCodingWeather script
task_05_summaryAnalysisDocument summarization
task_06_eventsResearchConference research
task_07_emailWritingEmail drafting
task_08_memoryMemoryContext retrieval
task_09_filesFilesFile structure creation
task_10_workflowIntegrationMulti-step API workflow
task_11_clawdhubSkillsClawHub interaction
task_12_skill_searchSkillsSkill discovery
task_13_image_genCreativeImage generation
task_14_humanizerWritingText humanization
task_15_daily_summaryProductivityDaily digest
task_16_email_triageEmailInbox triage
task_17_email_searchEmailEmail search
task_18_market_researchResearchMarket analysis
task_19_spreadsheet_summaryAnalysisSpreadsheet analysis
task_20_eli5_pdf_summaryAnalysisPDF simplification
task_21_openclaw_comprehensionKnowledgeOpenClaw docs comprehension
task_22_second_brainMemoryKnowledge management

Command Line Options

OptionDescription
--modelModel identifier (e.g., anthropic/claude-sonnet-4)
--suiteall, automated-only, or comma-separated task IDs
--output-dirResults directory (default: results/)
--timeout-multiplierScale task timeouts for slower models
--runsNumber of runs per task for averaging
--no-uploadSkip uploading to leaderboard
--registerRequest new API token for submissions
--upload FILEUpload previous results JSON

Token Registration

To submit results to the leaderboard:

# Register for an API token (one-time)
uv run benchmark.py --register

# Run benchmark (auto-uploads with token)
uv run benchmark.py --model anthropic/claude-sonnet-4

Results

Results are saved as JSON in the output directory:

# View task scores
jq '.tasks[] | {task_id, score: .grading.mean}' results/0001_anthropic-claude-sonnet-4.json

# Show failed tasks
jq '.tasks[] | select(.grading.mean < 0.5)' results/*.json

# Calculate overall score
jq '{average: ([.tasks[].grading.mean] | add / length)}' results/*.json

Adding Custom Tasks

Create a markdown file in tasks/ following TASK_TEMPLATE.md. Each task needs:

  • YAML frontmatter (id, name, category, grading_type, timeout)
  • Prompt section
  • Expected behavior
  • Grading criteria
  • Automated checks (Python grading function)

Leaderboard

View results at pinchbench.com. The leaderboard shows:

  • Model rankings by overall score
  • Per-task breakdowns
  • Historical performance trends

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Xiaohongshu Ops

小红书端到端运营:账号定位、选题研究、内容生产、发布执行、数据复盘。 Use when: (1) 用户要写小红书笔记/帖子, (2) 用户说"发小红书"/"写个种草文"/"出一篇小红书", (3) 用户讨论小红书选题/热点/爆款分析/竞品对标, (4) 用户提到账号定位/人设/内容方向规划, (5) 用户要求生成...

Registry SourceRecently Updated
Automation

WeMP Ops

微信公众号全流程运营:选题→采集→写作→排版→发布→数据分析→评论管理。 Use when: (1) 用户要写公众号文章或提供了选题方向, (2) 用户说"写一篇关于XXX的文章"/"帮我写篇推文"/"出一篇稿子", (3) 用户要求采集热点/素材/竞品分析, (4) 用户提到公众号日报/周报/数据分析/阅读量/...

Registry SourceRecently Updated
Automation

agent-stock

用于股票行情查询与分析的命令行技能。用户提到 stock 命令、股票代码、最新资讯、市场概览、K 线或配置管理时调用。

Registry SourceRecently Updated