eval-harness-updater

Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "eval-harness-updater" with this command: npx skills add oimiragieo/agent-studio/oimiragieo-agent-studio-eval-harness-updater

Eval Harness Updater

Refresh eval harnesses to keep live + fallback modes actionable under unstable environments.

Focus Areas

  • Prompt and parser drift

  • Timeout/partial-stream handling

  • SLO and regression gates

  • Dual-run fallback consistency

Workflow

  • Resolve harness path.

  • Research test/eval best practices (Exa + arXiv — see Research Gate below).

  • Add RED regressions for parsing and timeout edge cases.

  • Patch minimal harness logic.

  • Validate eval outputs and CI gates.

  • Resolve companion artifact gaps (see Cross-Reference table below).

Research Gate (Exa + arXiv — BOTH MANDATORY)

Before proposing harness changes, gather current best practices:

  • Use Exa for implementation and ecosystem patterns:

  • mcp__Exa__web_search_exa({ query: 'LLM eval harness 2025 best practices' })

  • mcp__Exa__get_code_context_exa({ query: 'eval harness parser reliability timeout handling' })

  • Search arXiv for academic research on evaluation methodology (mandatory):

  • Via Exa: mcp__Exa__web_search_exa({ query: 'site:arxiv.org LLM evaluation harness 2024 2025' })

  • Direct API: WebFetch({ url: 'https://arxiv.org/search/?query=LLM+evaluation+harness&searchtype=all&start=0' })

  • Record decisions, constraints, and non-goals in memory learnings.

arXiv is mandatory (not fallback) when topic involves: LLM evaluation, agent evaluation, SLO gates, regression testing methodology, or parser reliability.

Cross-Reference: Creator Ecosystem

This skill is part of the Creator Ecosystem. When research uncovers gaps, trigger the appropriate companion creator:

Gap Discovered Required Artifact Creator to Invoke When

Domain knowledge needs a reusable skill skill Skill({ skill: 'skill-creator' })

Gap is a full skill domain

Existing skill has incomplete coverage skill update Skill({ skill: 'skill-updater' })

Close skill exists but incomplete

Capability needs a dedicated agent agent Skill({ skill: 'agent-creator' })

Agent to own the capability

Existing agent needs capability update agent update Skill({ skill: 'agent-updater' })

Close agent exists but incomplete

Domain needs code/project scaffolding template Skill({ skill: 'template-creator' })

Reusable code patterns needed

Behavior needs pre/post execution guards hook Skill({ skill: 'hook-creator' })

Enforcement behavior required

Process needs multi-phase orchestration workflow Skill({ skill: 'workflow-creator' })

Multi-step coordination needed

Artifact needs structured I/O validation schema Skill({ skill: 'schema-creator' })

JSON schema for artifact I/O

User interaction needs a slash command command Skill({ skill: 'command-creator' })

User-facing shortcut needed

Repeated logic needs a reusable CLI tool tool Skill({ skill: 'tool-creator' })

CLI utility needed

Narrow/single-artifact capability only inline Document within this artifact only Too specific to generalize

Iron Laws

  • ALWAYS run the Exa + arXiv research gate before updating any eval harness — updating without current external knowledge produces stale evaluation criteria.

  • NEVER remove existing evaluation criteria without replacing them with equivalent or better ones — reducing test coverage in an eval harness is a regression.

  • ALWAYS cross-reference the creator ecosystem for gaps before declaring the harness complete — missing companion artifacts (skills, agents, schemas) leave the harness unable to test new capabilities.

  • NEVER update eval harness in isolation from the skill/agent it evaluates — harness and artifact must stay synchronized or the harness tests the wrong behavior.

  • ALWAYS preserve backward compatibility in eval scoring — changing scoring semantics without migrating historical baselines makes trend analysis impossible.

Anti-Patterns

Anti-Pattern Why It Fails Correct Approach

Updating eval harness without research gate Criteria based on outdated knowledge; misses recent evaluation methodology advances Always run Exa + arXiv research before updating any eval criteria

Removing test cases to simplify the harness Silently reduces coverage; regressions pass undetected Only remove test cases when the behavior they tested has been deliberately removed

Harness and artifact in separate PRs Harness tests wrong behavior the moment artifact changes; immediate test drift Always update harness and artifact in the same commit

Changing scoring scale mid-project Historical baselines become incomparable; trend analysis breaks Define scoring scale once; create a migration if it must change

Declaring harness complete without companion check Missing skills or schemas leave evaluation gaps Always run companion artifact check before marking harness update complete

Memory Protocol (MANDATORY)

Before starting: Read .claude/context/memory/learnings.md

After completing:

  • New evaluation pattern → .claude/context/memory/learnings.md

  • Evaluation gap found → .claude/context/memory/issues.md

  • Scoring decision made → .claude/context/memory/decisions.md

ASSUME INTERRUPTION: If it's not in memory, it didn't happen.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

filesystem

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

slack-notifications

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

chrome-browser

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

text-to-sql

No summary provided by upstream source.

Repository SourceNeeds Review