ING Skill Generator — Complete Knowledge Base
Generate production-ready GitHub Copilot skills from ING documentation repositories. This skill transforms documentation-as-code into self-contained expert knowledge bases that senior engineers can use in their Spring Boot / Java 21 projects.
This skill includes:
- Skill generation from local cloned repos
- Evaluation framework with with-skill vs baseline comparison
- Grading agents for automated assertion checking
- Benchmark aggregation and interactive review viewer
- Description optimization for better triggering
At a high level, the process goes like this:
- User provides a path to a locally cloned ING documentation repository
- Analyze the repo structure, identify tool name, latest version, documentation files
- Extract and synthesize content following ING skill template (8 sections)
- Generate the SKILL.md with proper frontmatter and verbatim code examples
- Run test cases (with-skill vs baseline) to verify quality
- Review results, iterate based on feedback
- Optimize description for better triggering
Your job is to figure out where the user is in this process and help them progress. Maybe they have a freshly cloned repo and want a skill generated. Or maybe they already have a draft and want to improve it. Be flexible — if they say "just generate the skill, I don't need evals", do that instead.
Communicating with the User
The ING skill generator may be used by people across a range of familiarity with coding jargon. While most users are likely senior engineers, pay attention to context cues.
Default assumptions:
- "evaluation" and "benchmark" are OK
- "JSON" and "assertion" — look for cues the user knows these before using without explanation
- ING-specific terms (Baker, Merak, Kingsroad) — explain briefly if unclear
It's OK to briefly explain terms if you're in doubt. Feel free to clarify with a short definition if unsure.
1. Process Overview
- Analyze the repository — identify the tool name, latest version, documentation structure
- Extract content — gather all relevant docs, configs, code examples, warnings
- Synthesize knowledge base — merge, dedupe, organize into standard sections
- Output the skill file — produce a valid SKILL.md with proper frontmatter
Creating a Skill from Repository
This is the core workflow for generating ING skills from documentation repositories.
Step 1: Capture Intent
Start by understanding what the user wants. Key questions:
- What repository? Get the local path to the cloned repo (e.g.,
/tmp/baker-docs,~/projects/merak-sdk) - What tool/framework? Confirm the tool name if not obvious from the repo
- Any specific focus? Sometimes users want only certain parts (e.g., "just the API, not the tutorials")
- Run test cases? Suggest yes for complex repos, optional for simple ones
If the conversation already contains this info (e.g., "generate a skill from /tmp/baker"), extract it and confirm.
Step 2: Analyze the Repository
Before writing anything, understand the documentation structure:
# Find all documentation files
find <repo-path> -name "*.md" -o -name "*.adoc" -o -name "*.rst" | head -50
# Check for docs directory
ls -la <repo-path>/docs/ 2>/dev/null || ls -la <repo-path>/
# Look for version info
cat <repo-path>/pom.xml 2>/dev/null | grep -A1 "<version>" | head -5
cat <repo-path>/package.json 2>/dev/null | grep "version"
cat <repo-path>/CHANGELOG.md 2>/dev/null | head -20
Identify:
- Tool name — from repo name, README title, or project config
- Current version — from pom.xml, package.json, build.gradle, or badges
- Documentation structure — where the main docs live, how they're organized
- Code examples — where sample code is located
Step 3: Map Documentation to Sections
Create a mental map of which source files feed into which output sections:
| Source Files | → Output Section |
|---|---|
| README.md, docs/overview.md, docs/intro.md | 1. Overview |
| docs/concepts.md, docs/architecture.md | 2. Core Concepts |
| docs/configuration.md, application.properties | 3. Configuration Reference |
| examples/, docs/tutorials/, docs/guides/ | 4. Code Examples |
| docs/integration.md, docs/other-tools.md | 5. Integration |
| docs/troubleshooting.md, docs/faq.md, comments in code | 6. Pitfalls & Anti-patterns |
| docs/faq.md (or generate from common questions) | 7. FAQ |
| Terminology in any doc | 8. Glossary |
Step 4: Extract and Synthesize
Read each relevant file and extract content:
- Copy code verbatim — never summarize or paraphrase code blocks
- Merge duplicates — if the same concept appears in multiple places, combine into one section
- Capture tribal knowledge — look for comments like "WARNING", "NOTE", "IMPORTANT", gotchas in examples
- Mark gaps — if a section is sparse, include it anyway with ⚠️ marker
Step 5: Generate the SKILL.md
Follow the exact template structure in Section 6 (Output Template). Key requirements:
- YAML frontmatter with
name(kebab-case) anddescription(comprehensive, trigger-friendly) - All 8 sections present, even if sparse
- Configuration tables with 4 columns: Property, Type, Default, Description
- No hyperlinks — all content inline
Skill Writing Guide
Anatomy of an ING Skill
skill-name/
├── SKILL.md (required)
│ ├── YAML frontmatter (name, description required)
│ └── Markdown instructions (8 sections)
└── Bundled Resources (optional)
├── scripts/ - Executable code for tasks
├── references/ - Additional docs loaded as needed
└── assets/ - Templates, examples
Progressive Disclosure
Skills use a three-level loading system:
- Metadata (name + description) — Always in context (~100 words)
- SKILL.md body — In context when skill triggers (<500 lines ideal)
- Bundled resources — As needed (read explicitly when required)
Key patterns:
- Keep SKILL.md under 500 lines; if approaching limit, move detail to
references/ - Reference files clearly with guidance on when to read them
- For large reference files (>300 lines), include a table of contents
Writing Patterns
Prefer imperative form in instructions.
Defining output formats:
## Configuration Reference
ALWAYS use this exact table format:
| Property | Type | Default | Description |
|----------|------|---------|-------------|
Examples pattern:
## Code Examples
**Example 1: Basic Recipe**
```java
// Verbatim code from source
Recipe recipe = new Recipe("OrderProcess")
.withInteraction(validateOrder)
.withSensoryEvent(orderPlaced);
#### Writing Style
Explain **why** things are important rather than heavy-handed MUSTs. Use theory of mind and make the skill general, not narrow to specific examples. Write a draft, then review with fresh eyes and improve.
### Step 6: Test Cases
After generating the skill, create 2-3 realistic test prompts. Save to `evals/evals.json`:
```json
{
"skill_name": "baker-framework",
"evals": [
{
"id": 1,
"name": "basic-recipe-creation",
"prompt": "Generate a skill from the Baker docs at /tmp/baker",
"expected_output": "SKILL.md with 8 sections, proper frontmatter",
"files": [],
"expectations": []
}
]
}
See references/schemas.md for the full schema.
2. Naming Rules
Derive the canonical name directly from the repository:
| Source | Priority |
|---|---|
| Repository name | Highest (e.g., ing-bank/baker → baker) |
| Top-level README title | If repo name is generic |
| Project folder name | Fallback |
Critical rules:
- Use exactly what the project is called — no inventing, generalizing, or renaming
- Convert to kebab-case for the skill
namefield (e.g.,Baker Framework→baker-framework) - If the repo covers multiple tools, derive each tool's name from its module/subfolder/section title
3. Versioning Strategy
When documentation contains multiple versions:
-
Identify latest version using:
- Explicit version numbers (e.g.,
v4.1.0>v3.2.0) - Release dates (e.g.,
2024>2023) - Folder/file naming (e.g.,
docs-v2/>docs-v1/) CHANGELOG.mdor release notes
- Explicit version numbers (e.g.,
-
Use latest version as source of truth for all:
- Configuration properties
- API signatures
- Code examples
- Behavioral descriptions
-
Document version changes when relevant:
📌 Changed in 4.0 — previous behavior was: synchronous execution only -
Discard deprecated content unless it explains a still-relevant migration path
4. Content Extraction
4.1 What to Include
| Content Type | Handling |
|---|---|
| Code snippets | Copy verbatim — never summarize |
| Configuration blocks | Copy verbatim with all properties |
| API signatures | Copy verbatim with types and parameters |
| Architecture diagrams (textual) | Include as ASCII or describe structure |
| Warnings / gotchas | Always include, even if brief |
| Anti-patterns | Always include with explanations |
| Tribal knowledge | Capture implicit knowledge from comments, examples |
4.2 What to Exclude
- Hyperlinks (all content must be inline)
- File path references to the source repo
- Installation instructions for the docs themselves
- CI/CD pipeline configs for the docs repo
- Contributor guidelines (unless relevant to framework usage)
4.3 Merging and Deduplication
When a concept appears in multiple files:
- Identify all occurrences
- Merge into one coherent section
- Preserve all unique details from each source
- Remove redundant explanations
5. Handling Sparse Documentation
When documentation is incomplete or ambiguous:
-
Add a clear marker:
⚠️ Documentation incomplete — verify with team -
Include whatever partial information exists
-
Note specific gaps:
⚠️ Default value not documented — verify in source code
6. Output Template
CRITICAL: All 8 sections MUST be present in every generated skill. If documentation is sparse for a section, include it anyway with a ⚠️ marker noting what's missing.
The generated skill must follow this exact structure:
---
name: [tool-name-kebab-case]
description: >
Expert skill for [Tool Name] — an ING-internal framework for [one-line purpose].
Use this skill when working in any ING Spring Boot / Java 21 project that integrates
with [Tool Name]. Covers configuration, recipes, integration patterns, pitfalls,
and verbatim code examples.
---
# [Tool Name] — Complete Knowledge Base
## Table of Contents
1. Overview
2. Core Concepts
3. Configuration Reference
4. Code Examples
5. Integration with Other ING Tools
6. Pitfalls & Anti-patterns
7. FAQ
8. Glossary
---
## 1. Overview
[What the tool does and why it exists inside ING. MUST include current version number.]
### Current Version: X.Y.Z
[Version-specific notes if any]
## 2. Core Concepts
[Mental models, architecture decisions, key abstractions. Use tables for comparisons.]
## 3. Configuration Reference
**MANDATORY: Configuration tables MUST have exactly 4 columns: Property, Type, Default, Description**
| Property | Type | Default | Description |
|----------|------|---------|-------------|
| example.property | String | `null` | Purpose of property |
| example.timeout | int | `3000` | Timeout in milliseconds ⚠️ NOT seconds |
If default is unknown, use: `⚠️ verify` as the value.
## 4. Code Examples
[Verbatim snippets from source docs, organized by use case. NEVER summarize or paraphrase code.]
```java
// Example: Basic interaction definition
@RequiresIngredient("orderId")
@FiresEvent(OrderValidated.class)
public interface ValidateOrder {
OrderValidated apply(String orderId);
}
5. Integration with Other ING Tools
[How this tool connects to Baker / Merak SDK / Kingsroad / other ING systems]
⚠️ If no integration docs exist, write: "No documented integrations. Check with the team for internal usage patterns."
6. Pitfalls & Anti-patterns
[Exact warnings from docs + implicit gotchas discovered in examples]
❌ Don't: [Anti-pattern description] ✅ Do: [Correct approach]
⚠️ If no pitfalls documented, write: "No pitfalls documented. Exercise standard caution with [relevant concerns]."
7. FAQ
Q: [Common question from docs or implied by content] A: [Answer]
⚠️ If no FAQ exists, generate 2-3 questions based on likely user needs.
8. Glossary
| Term | Definition |
|---|---|
| [ING-specific term] | [Precise definition] |
⚠️ If no glossary exists, extract key terms from the documentation and define them.
---
## 7. Frontmatter Requirements
### 7.1 Name Field
- Kebab-case, lowercase
- Derived from repo/project name
- Example: `baker-framework`, `merak-sdk`, `kingsroad-cli`
### 7.2 Description Field
The description is the **primary trigger mechanism**. Make it comprehensive:
1. Start with what the skill is for
2. Include the framework's purpose
3. List specific contexts when to use
4. Mention related keywords that should trigger
**Good example:**
```yaml
description: >
Expert skill for Baker — an ING-internal framework for orchestrating microservice-based
process flows using a declarative recipe DSL. Use this skill when working in any ING
Spring Boot / Java 21 project that integrates with Baker. Covers configuration, recipes,
interactions, event handling, error strategies, testing, and verbatim code examples.
8. Target Audience
The generated skill targets:
- Senior engineers at ING working in Spring Boot / Java 21 projects on Kubernetes
- They know general software engineering
- They do not know ING-internal framework internals
- They need practical, actionable guidance
Write accordingly:
- Explain ING-specific concepts
- Don't explain basic Java/Spring concepts
- Include complete, working examples
- Highlight common mistakes
9. Quality Checklist
Before finalizing a generated skill, verify:
Structural Requirements (MANDATORY):
- YAML frontmatter present with
---delimiters -
namefield is kebab-case, derived from repo/project -
descriptionfield includes purpose and trigger keywords - All 8 sections present (Overview through Glossary)
- Table of Contents matches section headings
Content Requirements:
- Latest version identified and stated in Overview
- All code snippets copied verbatim (no summarization)
- Configuration table has 4 columns: Property, Type, Default, Description
- No hyperlinks or external URLs anywhere
- Sparse sections marked with ⚠️ (not omitted)
- Version changes marked with 📌
- Content is self-contained and usable in isolation
Common Mistakes to Avoid:
- ❌ Omitting sections because docs are sparse (always include with ⚠️)
- ❌ Missing "Default" column in config tables
- ❌ Summarizing code instead of copying verbatim
- ❌ Using non-kebab-case names (e.g., "Baker_Framework" instead of "baker-framework")
- ❌ Including hyperlinks (convert to inline content)
10. Example Workflow
When asked to generate a skill from a repo:
-
Read the repo structure
find <repo-path> -name "*.md" -o -name "*.adoc" | head -50 ls <repo-path>/docs/ 2>/dev/null || ls <repo-path>/ -
Identify the tool name and version
- Check README.md, pom.xml, build.gradle, package.json
- Look for version badges, changelog, releases
-
Map documentation to output sections
- Overview/Introduction → Section 1
- Concepts/Architecture → Section 2
- Configuration/Properties → Section 3
- Examples/Tutorials → Section 4
- Integration guides → Section 5
- Troubleshooting/Warnings → Section 6
- FAQ (if exists) → Section 7
- Glossary/Terms → Section 8
-
Extract and synthesize
- Read each relevant file
- Copy code blocks verbatim
- Merge duplicate explanations
- Note gaps with ⚠️ markers
-
Generate the SKILL.md
- Use exact template structure
- Validate frontmatter YAML
- Ensure no broken references
-
Save to appropriate location
- Default:
.github/skills/[tool-name]/SKILL.md
- Default:
11. Running and Evaluating Test Cases
After generating a skill, run test cases to verify quality. Put results in <skill-name>-workspace/ as a sibling to the skill directory.
Step 1: Spawn all runs (with-skill AND baseline) in parallel
For each test case, spawn two subagents in the same turn — one with the skill, one without:
With-skill run:
Execute this task:
- Skill path: <path-to-skill>/SKILL.md
- Task: <eval prompt - e.g., "Generate a skill from the Baker docs at /tmp/baker">
- Input files: <path to cloned repo>
- Save outputs to: <workspace>/iteration-<N>/eval-<name>/with_skill/outputs/
- Outputs to save: The generated SKILL.md file
IMPORTANT: First read the skill, then follow its instructions.
Baseline run (no skill):
Execute this task (no skill guidance - baseline):
- Task: <same eval prompt>
- Input files: <same repo path>
- Save outputs to: <workspace>/iteration-<N>/eval-<name>/without_skill/outputs/
- Outputs to save: The generated SKILL.md file
Write an eval_metadata.json for each test case:
{
"eval_id": 1,
"eval_name": "baker-repo-full",
"prompt": "Generate a skill from the Baker docs at /tmp/baker",
"assertions": [
"Output is a valid SKILL.md file with YAML frontmatter",
"Contains all 8 required sections",
"Code examples are verbatim from source"
]
}
Step 2: Draft assertions while runs are in progress
Good assertions for ING skill generation:
Structural:
- "Output has YAML frontmatter with --- delimiters"
- "Frontmatter contains 'name' field in kebab-case"
- "Contains all 8 sections: Overview through Glossary"
- "Configuration table has 4 columns: Property, Type, Default, Description"
Content:
- "Version number X.Y.Z is mentioned in Overview"
- "Code examples are verbatim (not summarized)"
- "No hyperlinks or external URLs"
- "Sparse sections marked with ⚠️"
Version handling:
- "Uses only latest version content"
- "Deprecated content excluded"
- "Version changes marked with 📌"
Step 3: Capture timing data as runs complete
When each subagent completes, save timing to timing.json:
{
"total_tokens": 84852,
"duration_ms": 23332,
"total_duration_seconds": 23.3
}
Step 4: Grade, aggregate, and launch viewer
- Grade each run — spawn a grader subagent that reads
agents/grader.mdand evaluates assertions. Save tograding.json:
{
"with_skill": {
"assertions": [
{"text": "Has YAML frontmatter", "passed": true, "evidence": "File starts with ---"},
{"text": "All 8 sections present", "passed": true, "evidence": "Found sections 1-8"}
],
"pass_rate": 1.0
},
"without_skill": {
"assertions": [
{"text": "Has YAML frontmatter", "passed": false, "evidence": "No --- delimiters found"},
{"text": "All 8 sections present", "passed": false, "evidence": "Missing FAQ, Glossary"}
],
"pass_rate": 0.17
},
"winner": "with_skill",
"notes": "Skill enforces ING template structure"
}
- Aggregate into benchmark:
python -m scripts.aggregate_benchmark <workspace>/iteration-N --skill-name <name>
- Launch the viewer:
python eval-viewer/generate_review.py <workspace>/iteration-N \
--skill-name "ing-skill-generator" \
--benchmark <workspace>/iteration-N/benchmark.json
For iteration 2+, add --previous-workspace <workspace>/iteration-<N-1>.
For headless environments, use --static <output.html> instead.
Step 5: Read feedback and improve
When the user reviews results, read feedback.json:
{
"reviews": [
{"run_id": "eval-1-with_skill", "feedback": "missing version number in overview"},
{"run_id": "eval-2-with_skill", "feedback": ""}
]
}
Empty feedback = user is satisfied. Focus improvements on cases with complaints.
12. Improving the Skill
After running test cases and collecting feedback:
How to improve ING skill generation
-
Check structural compliance — If outputs are missing sections or using wrong formats, strengthen the template instructions with explicit requirements.
-
Check content extraction — If code examples are summarized instead of verbatim, add more emphasis on copying exactly. If warnings/pitfalls are missed, add instructions to scan for keywords like "WARNING", "NOTE", "⚠️".
-
Check version handling — If old content leaks in, add clearer instructions to identify and exclude deprecated versions.
-
Look at transcripts — Read how the subagent processed the docs. If it's doing redundant work or missing files, adjust the workflow instructions.
-
Look for repeated work — If all test runs independently wrote similar helper scripts or took the same approach, consider bundling that script in the skill's
scripts/directory.
The iteration loop
- Apply improvements to
SKILL.md - Rerun all test cases into
iteration-<N+1>/ - Launch viewer with
--previous-workspacepointing to previous iteration - Collect feedback, improve, repeat
Keep going until:
- User is happy
- All feedback is empty
- Pass rates are consistently high
13. Advanced: Blind Comparison
For situations where you want a more rigorous comparison between two versions of a skill (e.g., "is the new version actually better?"), there's a blind comparison system.
How it works
- Give two outputs to an independent agent without telling it which is which
- Let it judge quality based purely on the outputs
- Analyze why the winner won
Read agents/comparator.md and agents/analyzer.md for the details.
When to use
- Comparing a new skill version against the previous version
- Deciding between two different approaches to the same problem
- When quantitative metrics (pass rates) are similar but you sense a quality difference
This is optional and requires subagents. The human review loop is usually sufficient.
14. Description Optimization
The description field in SKILL.md frontmatter is the primary mechanism that determines whether Claude invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.
Step 1: Generate trigger eval queries
Create 20 eval queries — a mix of should-trigger (8-10) and should-not-trigger (8-10).
The queries must be realistic — the kind of thing a real Claude Code user would actually type. Include:
- File paths and personal context
- Different lengths and styles (formal, casual, typos)
- Edge cases, not clear-cut examples
Bad examples:
"Format this data"
"Extract text from PDF"
"Create a skill"
Good examples:
"ok so I just cloned the merak-sdk repo to /tmp/merak and my tech lead wants me to turn the docs into something our team can use in their IDE. can you help?"
"I have the Baker framework documentation at ~/projects/ing-bank/baker/docs. Need to create a Copilot skill that covers all the recipe patterns and error handling strategies."
"we're using kingsroad-cli internally and the docs are scattered across like 5 different markdown files. can you consolidate them into a skill?"
For should-trigger queries, think about coverage:
- Different phrasings of the same intent (formal, casual)
- Cases where the user doesn't explicitly say "skill" but clearly needs one
- Mentions of ING frameworks (Baker, Merak, Kingsroad)
- References to documentation repos, docs/ folders
For should-not-trigger queries, the most valuable are near-misses:
- Using the frameworks (not creating skills for them)
- "How do I configure Baker retry policies?" — needs Baker skill, not skill generator
- General Spring Boot/Java questions
- Other types of skill creation (not ING-specific)
The key: don't make should-not-trigger queries obviously irrelevant. "Write a fibonacci function" is too easy — it doesn't test anything. Negative cases should be genuinely tricky.
Step 2: Review with user
Present the eval set for review using the HTML template:
- Read the template from
assets/eval_review.html - Replace placeholders:
__EVAL_DATA_PLACEHOLDER__→ the JSON array__SKILL_NAME_PLACEHOLDER__→ skill name__SKILL_DESCRIPTION_PLACEHOLDER__→ current description
- Write to temp file and open:
open /tmp/eval_review_ing-skill-generator.html - User edits queries, toggles should-trigger, then clicks "Export Eval Set"
- File downloads to
~/Downloads/eval_set.json
This step matters — bad eval queries lead to bad descriptions.
Step 3: Run the optimization loop
Tell the user: "This will take some time — I'll run in the background and check periodically."
python -m scripts.run_loop \
--eval-set <workspace>/trigger-eval.json \
--skill-path <skill-path> \
--model <model-id-powering-this-session> \
--max-iterations 5 \
--verbose
Use the model ID from your system prompt so triggering tests match what the user experiences.
The script:
- Splits eval set into 60% train / 40% held-out test
- Evaluates current description (3 runs per query for reliability)
- Proposes improvements based on failures
- Re-evaluates each new description on both train and test
- Selects best by test score (not train) to avoid overfitting
How skill triggering works
Understanding this helps design better eval queries:
- Skills appear in Claude's
available_skillslist with name + description - Claude decides whether to consult a skill based on that description
- Important: Claude only consults skills for tasks it can't easily handle on its own
This means:
- Simple queries like "read this file" may not trigger skills even if description matches
- Complex, multi-step, or specialized queries reliably trigger when description matches
- Your eval queries should be substantive enough that Claude would benefit from consulting a skill
Step 4: Apply results
Take best_description from the JSON output and update SKILL.md frontmatter. Show the user before/after and report scores.
Package and Present
If you have access to the present_files tool, package the skill:
python -m scripts.package_skill <path/to/skill-folder>
This creates a .skill file the user can install.
15. Claude.ai-Specific Instructions
In Claude.ai, the core workflow is the same (analyze repo → generate skill → test → review → improve), but some mechanics change because Claude.ai doesn't have subagents.
Running test cases: No subagents means no parallel execution. For each test case:
- Read the skill's SKILL.md
- Follow its instructions to accomplish the test prompt yourself
- Do them one at a time
This is less rigorous than independent subagents (you wrote the skill and you're running it), but it's a useful sanity check — the human review step compensates.
Reviewing results: If you can't open a browser (no display), skip the browser reviewer. Instead, present results directly in the conversation:
- Show the prompt and output for each test case
- If output is a file, save it and tell the user where to download
- Ask for feedback inline: "How does this look? Anything you'd change?"
Benchmarking: Skip quantitative benchmarking — it relies on baseline comparisons which aren't meaningful without subagents. Focus on qualitative feedback.
The iteration loop: Same as before — improve the skill, rerun test cases, ask for feedback — just without the browser reviewer.
Description optimization: Requires claude -p CLI which is only in Claude Code. Skip it on Claude.ai.
Blind comparison: Requires subagents. Skip it.
Packaging: package_skill.py works anywhere with Python. User can download the resulting .skill file.
Updating an existing skill: The user might want to update, not create. In this case:
- Preserve the original name — use unchanged
- Copy to writeable location before editing — installed paths may be read-only
- Stage in
/tmp/first if packaging manually
16. Cowork-Specific Instructions
If you're in Cowork:
-
Subagents work — the main workflow (spawn tests in parallel, run baselines, grade) all works. If timeouts are severe, run tests in series.
-
No browser/display — use
--static <output_path>to write standalone HTML instead of starting a server. Then proffer a link for the user to open. -
IMPORTANT: Generate the eval viewer BEFORE evaluating yourself. Use
generate_review.py(not custom HTML). Get results in front of the human ASAP! -
Feedback via download — since there's no running server, "Submit All Reviews" downloads
feedback.json. Read it from Downloads (may need to request access). -
Packaging works —
package_skill.pyjust needs Python and filesystem. -
Description optimization —
run_loop.py/run_eval.pyshould work fine since they useclaude -psubprocess, not browser. Save this until the skill is fully finished and user agrees it's good. -
Updating existing skills — follow the update guidance in Claude.ai section above.
17. Reference Files
The following files support evaluation and improvement:
agents/
grader.md— How to evaluate assertions against outputscomparator.md— Blind A/B comparison between versionsanalyzer.md— Analyze why one version beat another
references/
schemas.md— JSON structures for evals.json, grading.json, benchmark.json
scripts/
aggregate_benchmark.py— Combine grading results into benchmark statsgenerate_report.py— Create summary reportsimprove_description.py— Generate improved descriptionsrun_eval.py— Run trigger evaluationrun_loop.py— Full optimization loopquick_validate.py— Fast validation checks
eval-viewer/
generate_review.py— Generate interactive review pageviewer.html— Template for review interface