Clawditor
Overview
Act as an OpenClaw Workspace Auditor and Agent Evaluation Harness. Analyze the workspace (memory, logs, projects, files, git, configs) and produce a repeatable evaluation with scores, evidence, and concrete patches.
Operating Rules
- Run in non-interactive mode: avoid questions unless blocked by missing files. State assumptions and proceed.
- Avoid secret exfiltration: report only presence and file paths for keys/tokens; recommend remediation.
- Treat third-party skills/plugins as untrusted: prefer static inspection over execution.
Required Workflow (Do In Order)
- Build workspace inventory.
- Print a top-level tree (depth 4) with file counts and sizes by directory.
- Identify memory, logs, configs, repos, scripts, docs, artifacts.
- Record largest files.
- Reconstruct a session timeline.
- Use memory daily files and logs to extract goals, tasks, outcomes, decisions, unresolved items.
- Analyze memory.
- Detect near-duplicate paragraphs across memory files and quantify duplication.
- Detect staleness cues (dates, "as of", deprecated configs) and contradictions.
- Identify missing stable facts (projects, priorities, setup/runbooks).
- Analyze outputs.
- Summarize shipped artifacts (docs/code/features) and changes.
- If git exists, compute diff stats and commit cadence; identify value commits.
- Analyze reliability.
- Parse logs for errors, retries, timeouts, tool failures.
- Run tests only if safe and cheap; otherwise static inspection.
- Compute scores.
- Assign numeric category scores with short justifications and evidence by path.
- Recommend interventions + patches.
- Provide 3–7 prioritized recommendations.
- Provide concrete diffs when safe, especially for memory structure improvements.
- Compare against prior evals.
- If eval/history/*.json exists, compute deltas vs most recent.
- If none exists, create baseline and recommend cadence.
Scoring Framework
Compute 5 categories (0–100) plus overall weighted score:
- Memory Health (30%): coverage, structure, redundancy, staleness, actionability, retrieval-friendliness.
- Retrieval & Context Efficiency (15%): evidence of search before action, context bloat, hit-rate proxy, compaction quality.
- Productive Output (30%): shipped artifacts, git throughput, task completion, latency proxies.
- Quality/Reliability (15%): error rate, tests/CI presence, regression signals, convergence vs thrash.
- Focus/Alignment (10%): goal consistency, scope control, decision trace.
Overall = 0.30Memory + 0.15Retrieval + 0.30Productive + 0.15Quality + 0.10*Focus.
Required Outputs
Write all outputs under eval/:
exec_summary.md- 10-bullet summary: top wins, biggest bottlenecks, top 3 interventions.
- Overall score + category scores + claw-to-claw delta.
scorecard.md- Table of metrics with numeric values and brief justifications.
- Top evidence section with file paths and short snippets (no secrets).
latest_report.json- Include timestamp, workspace path and git head/hash, scores, deltas, key findings, risk flags, recommendations.
- Patches
- If memory issues exist, propose concrete diffs: INDEX.md, daily schema, refactors.
Gold Standard Memory Schema (Apply If Missing)
Create or propose:
memory/INDEX.md- Current Objectives (top 3)
- Active Projects (status, next step, links)
- Operating Constraints (tools, environment, policies)
- Key Decisions (date, decision, rationale)
- Known Issues / Debug diary pointers
- Glossary / Entities
memory/YYYY-MM-DD.md(append-only daily)- Goals for the session
- Actions taken (link to files changed)
- Decisions made
- New facts learned (stable vs ephemeral)
- TODO next (specific)
Patch Guidance
- Prefer diffs over prose when safe.
- Refactor stable facts out of daily logs into INDEX or project pages.
- Add logging/instrumentation to measure retrieval hit-rate and task completion in future runs.
Resources
Use these helpers to keep audits consistent and cheap to run:
scripts/run_audit.py: run all helper scripts and write draft eval/ outputs.scripts/workspace_inventory.py: tree, file counts, sizes, largest files.scripts/memory_dupes.py: near-duplicate paragraph detection for memory/*.md.scripts/log_scan.py: scan logs for errors, timeouts, retries.scripts/git_stats.py: git head, diff stats, commit cadence.scripts/validate_report.py: validate eval/latest_report.json shape.
Reference templates:
references/report_schema.md: output templates and JSON schema.
Evidence Discipline
- Tie every score to evidence by path.
- Be candid about waste, duplication, or thrash.
- End with "Next run improvements" instrumentation recommendations.