OCR Benchmark v2.0.0
Multi-model OCR accuracy comparison with fuzzy line-level scoring, cost tracking, and PPT report generation.
Setup
1. Install dependencies
cd ~/.openclaw/workspace/skills/ocr-benchmark/ocr-benchmark
pip install -r requirements.txt
2. Configure environment variables
Set the variables for the providers you want to use:
# Bedrock (Claude models) — uses your existing AWS credentials
export AWS_REGION=us-west-2 # or your preferred region
# Gemini (Google AI Studio)
export GOOGLE_API_KEY=your_key_here
# PaddleOCR — OPTIONAL, skip if not available
export PADDLEOCR_ENDPOINT=https://your-paddle-endpoint
export PADDLEOCR_TOKEN=your_token # optional auth token
Note on PaddleOCR: This provider requires an external API endpoint. If
PADDLEOCR_ENDPOINTis not set, it is automatically skipped — no error. If you don't have a PaddleOCR endpoint, simply don't set the env var.
3. Prepare images
Place your images locally (.jpg, .png, .webp). There is no automatic image download — provide local file paths on the command line.
Quick Start
Run benchmark on images
python3 scripts/run_benchmark.py \
--images img1.jpg img2.jpg img3.jpg \
--output-dir ./results \
--ground-truth ground_truth.json
Skip models with missing credentials (no error, just skips)
python3 scripts/run_benchmark.py \
--images img1.jpg \
--auto-skip \
--output-dir ./results
Run only specific models
python3 scripts/run_benchmark.py \
--images img1.jpg \
--models opus sonnet gemini3pro \
--output-dir ./results \
--ground-truth ground_truth.json
Score-only mode (re-score without re-running OCR)
python3 scripts/run_benchmark.py \
--score-only \
--output-dir ./results \
--ground-truth ground_truth.json
Generate PPT report from scored results
python3 scripts/make_report.py \
--results-dir ./results \
--images img1.jpg img2.jpg img3.jpg \
--scores ./results/scores.json \
--output report.pptx
Workflow
- Prepare images — collect your
.jpg/.pngfiles locally - Run benchmark —
run_benchmark.pycalls each model, saves{image}.{model}.json - Create ground truth — see
references/ground-truth-format.mdfor format - Score — run with
--ground-truthto producescores.jsonand a terminal table - Report —
make_report.pygenerates a shareable.pptx
Environment Variables
| Variable | Provider | Required? | Description |
|---|---|---|---|
AWS_REGION | Bedrock | Optional | Default: us-west-2 |
GOOGLE_API_KEY | Gemini | Yes | Google AI Studio API key |
PADDLEOCR_ENDPOINT | PaddleOCR | Optional | Endpoint URL; auto-skipped if unset |
PADDLEOCR_TOKEN | PaddleOCR | Optional | Auth token for PaddleOCR |
Missing variables: If a model's required env var is missing, it is automatically skipped with a warning. Use --auto-skip for completely silent skipping.
Available Models
See references/models.md for full model IDs, pricing, and provider notes.
| Key | Label | Provider |
|---|---|---|
opus | Claude Opus 4.6 | Bedrock |
sonnet | Claude Sonnet 4.6 | Bedrock |
haiku | Claude Haiku 4.5 | Bedrock |
gemini3pro | Gemini 3.1 Pro | Google AI Studio |
gemini3flash | Gemini 3.1 Flash-Lite | Google AI Studio |
paddleocr | PaddleOCR | External endpoint |
Scoring Logic (v2)
Scoring uses fuzzy line-level matching with Levenshtein edit distance (pure Python stdlib, no extra dependencies).
For each ground truth line, the best-matching model output line is found and classified:
| Type | Condition | Score |
|---|---|---|
| EXACT | Identical after normalization | 1.0 |
| CLOSE | Edit distance < 20% of length (punctuation/apostrophe diffs) | 0.8 |
| PARTIAL | Edit distance < 50% of length (real errors but mostly correct) | 0.5 |
| MISS | No matching line found | 0.0 |
Additionally, EXTRA lines are detected: model output lines that don't correspond to any ground truth line.
Normalization strips: whitespace, apostrophes/quotes (', ', `), common punctuation (*, ✓, ,, 、, :, (), 【】 etc.), then lowercases.
Example terminal output
========================================================================
OCR BENCHMARK RESULTS
========================================================================
# Model Score Details
------------------------------------------------------------------------
🥇 Gemini 3.1 Pro 98.7% Image001: 99% | Image002: 98%
🥈 Claude Opus 4.6 88.3% Image001: 90% | Image002: 87%
🥉 Claude Sonnet 4.6 85.1% Image001: 86% | Image002: 84%
4. Gemini 3.1 Flash-Lite 82.0% ...
========================================================================
📄 Image001
──────────────────────────────────────────────────────────────────────
┌─ Claude Opus 4.6 (90.0%)
│ ✅ EXACT │ 小胡鸭
│ 🟡 CLOSE │ GT: Sam's Coffee
│ │ Got: Sams Coffee [dist=2]
│ 🟠 PARTIAL │ GT: 浓郁香气
│ │ Got: 浓都香气 [dist=1]
│ ❌ MISS │ GT: 净含量580克
│ ⚠️ EXTRA lines (1):
│ + "Product of China"
└──────────────────────────────────────────────────────────────────────
Output Files
Each OCR run produces {image}.{model}.json:
{
"text_extracted": ["line1", "line2", ...],
"brand": "...",
"product_name": "...",
"net_weight": "...",
"ingredients": ["..."],
"other_fields": {},
"model": "Claude Opus 4.6",
"model_key": "opus",
"latency_seconds": 23.5,
"input_tokens": 800,
"output_tokens": 500
}
Scoring produces scores.json with per-image, per-line, per-model results.
Key Findings (2026-03, product packaging)
Human-verified ranking:
- Gemini 3.1 Pro (98.7%) — Best accuracy, ~$0.006/image
- Claude Opus 4.6 (92.3%) — High accuracy; occasional missed details
- Gemini 3.1 Flash (89.7%) — Best speed/cost ratio, 9.7s
- Claude Sonnet 4.6 (88.5%) — Stable structured output
- PaddleOCR (67.9%) — Free, character errors on packaging
- Claude Haiku 4.5 (42.3%) — Poor Chinese OCR
Lesson: Never assume any model is ground truth. Human verification is essential.