agent-evaluation

Agent Evaluation Methods

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "agent-evaluation" with this command: npx skills add eyadsibai/ltk/eyadsibai-ltk-agent-evaluation

Agent Evaluation Methods

Agent evaluation requires different approaches than traditional software. Agents are non-deterministic, may take different valid paths, and lack single correct answers.

Key Finding: 95% Performance Drivers

Research on BrowseComp found three factors explain 95% of variance:

Factor Variance Implication

Token usage 80% More tokens = better performance

Tool calls ~10% More exploration helps

Model choice ~5% Better models multiply efficiency

Implications: Model upgrades beat token increases. Multi-agent architectures validate.

Multi-Dimensional Rubric

Dimension Excellent Good Acceptable Failed

Factual accuracy All correct Minor errors Some errors Wrong

Completeness All aspects Most aspects Key aspects Missing

Citation accuracy All match Most match Some match Wrong

Tool efficiency Optimal Good Adequate Wasteful

LLM-as-Judge

evaluation_prompt = """ Task: {task_description} Agent Output: {agent_output} Ground Truth: {ground_truth}

Evaluate on:

  1. Factual accuracy (0-1)
  2. Completeness (0-1)
  3. Citation accuracy (0-1)
  4. Tool efficiency (0-1)

Provide scores and reasoning. """

Test Set Design

test_set = [ {"name": "simple", "complexity": "simple", "input": "What is capital of France?"}, {"name": "medium", "complexity": "medium", "input": "Compare Apple and Microsoft revenue"}, {"name": "complex", "complexity": "complex", "input": "Analyze Q1-Q4 sales trends"}, {"name": "very_complex", "complexity": "very_complex", "input": "Research AI tech, evaluate impact, recommend strategy"} ]

Evaluation Pipeline

def evaluate_agent(agent, test_set): results = [] for test in test_set: output = agent.run(test["input"]) scores = llm_judge(output, test) results.append({ "test": test["name"], "scores": scores, "passed": scores["overall"] >= 0.7 }) return results

Complexity Stratification

Level Characteristics

Simple Single tool call

Medium Multiple tool calls

Complex Many calls, ambiguity

Very Complex Extended interaction, deep reasoning

Context Engineering Evaluation

Test context strategies systematically:

  • Run agents with different strategies on same tests

  • Compare quality scores, token usage, efficiency

  • Identify degradation cliffs at different context sizes

Continuous Evaluation

  • Run evaluations on all agent changes

  • Track metrics over time

  • Set alerts for quality drops

  • Sample production interactions

Avoiding Pitfalls

Pitfall Solution

Path overfitting Evaluate outcomes, not steps

Ignoring edge cases Include diverse scenarios

Single metric Multi-dimensional rubrics

Ignoring context Test realistic context sizes

No human review Supplement automated eval

Best Practices

  • Use multi-dimensional rubrics

  • Evaluate outcomes, not specific paths

  • Cover complexity levels

  • Test with realistic context sizes

  • Run evaluations continuously

  • Supplement LLM with human review

  • Track metrics for trends

  • Set clear pass/fail thresholds

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

agent-browser

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

multi-agent-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
Automation

crewai-agents

No summary provided by upstream source.

Repository SourceNeeds Review