1-Minute Codebase Evaluation

Fast, parallel evaluation of codebases using Claude CLI with structured metrics.

Features

✅ Smart Scanning: Automatically skips .claude/ , node_modules/ , .git/ , and previous eval_* results
✅ Parallel Evaluation: Runs multiple metrics concurrently for speed
✅ Auto Ranking: Submits to TopVibeCoder API and gets your rank
✅ Progress Tracking: Saves ranking history to track improvements over time
✅ Detailed Reports: Generates comprehensive markdown reports with citations
✅ Terminal Bar Chart: Visual score display with Unicode block characters

Quick Start

Evaluate current directory (use by default)

.claude/skills/1-min-eval/scripts/run_eval.sh .

Evaluate with specific metrics

.claude/skills/1-min-eval/scripts/run_eval.sh /path/to/project --metrics impact,technical

Full evaluation with all metrics (DO NOT use by default)

.claude/skills/1-min-eval/scripts/run_eval.sh /path/to/project --all-metrics

How It Works

Scan: scan_codebase.py extracts repo tree and source code with line numbers
Evaluate: Runs parallel claude -p calls for each metric
Aggregate: Combines JSON results into a final report
Visualize: Displays terminal bar chart with scores

Example Output

After evaluation completes, you'll see a visual bar chart:

================================================== 📊 Evaluation Scores

presentation 6.25 | ████████████░░░░░░░░ impact 5.25 | ██████████░░░░░░░░░░ technical 1.75 | ███░░░░░░░░░░░░░░░░░ creativity 0.50 | █░░░░░░░░░░░░░░░░░░░ prompt_design 0.00 | ░░░░░░░░░░░░░░░░░░░░

Available Metrics

Metric Description

impact Real-world problem solving, usable experience

technical Architecture, robustness, LLM integration

creativity Originality, novel LLM usage

presentation UX clarity, onboarding, demo quality

prompt_design Prompt structure, staging, constraints

security Secure coding, auth, dependency hygiene

completion Description-to-code alignment

monetization Business potential analysis

Scoring Scale (0.00-10.00)

Range Meaning

0.00-2.50 Barely functional, major gaps

2.51-4.50 Minimal implementation, weak

4.51-6.50 Working but basic, clear gaps

6.51-8.50 Solid implementation, good quality

8.51-10.00 Excellent, production-ready

Configuration

Variable Default Description

EVAL_PARALLEL 4 Number of parallel evaluations

EVAL_TIMEOUT 300 Timeout per metric (seconds)

EVAL_MAX_CHARS 300000 Max chars to include

EVAL_MODEL claude-sonnet-4-5-20250929 Model to use for evaluation

Ranking & Progress Tracking

Results are automatically submitted to the TopVibeCoder ranking API to get:

Overall rank and percentile
Per-metric rankings (individual rank for each metric)
Comparison with nearby apps
Historical progress tracking

Rankings are saved to ranking_history.jsonl in the output directory and to .evals/history.jsonl for unified tracking across all evaluations.

Note: The ranking API uses browser-like headers to bypass Cloudflare protection, ensuring reliable submissions. If the API fails, the evaluation continues and results are still saved locally.

Output Structure

Results saved to .evals/<timestamp>_<project>/ (hidden directory):

codebase.md
Scanned source code
codebase.json
Structured metadata
prompts/
Generated evaluation prompts
results/
JSON results per metric
logs/
Execution logs
report.md
Aggregated markdown report with ranking
ranking_history.jsonl
Historical ranking data (one entry per evaluation)

Note: Evaluation results are saved to a hidden .evals/ directory to keep your workspace clean. Add .evals/ to your .gitignore if you don't want to commit evaluation results.

Manual Usage

You can also run components individually:

1. Scan codebase

python3 .claude/skills/1-min-eval/scripts/scan_codebase.py ./project
--output /tmp/code.md --max-chars 300000

2. Run single metric evaluation

cat /tmp/code.md | claude -p "Evaluate for IMPACT..." --output-format json

3. Aggregate results

python3 .claude/skills/1-min-eval/scripts/aggregate.py
--input-dir ./results --output ./report.md

Adding Custom Metadata

Create metadata.json in project root:

{ "name": "My App", "description": "An AI-powered tool that...", "author": "Your Name" }

Tips

Large codebases: Use --max-chars 500000 for more context
Debugging: Add --verbose to see detailed output
Resume: Results are cached; re-run skips completed metrics
Single metric: Use --metrics impact for quick test

1-min-eval

Safety Notice

Copy this and send it to your AI assistant to learn