reddi-agent-evaluation

reddi.tech fork of agent-evaluation. Testing and benchmarking LLM agents including behavioral testing, capability assessment, reliability metrics, and production monitoring. Use when: agent testing, agent evaluation, benchmark agents, agent reliability, test agent.

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "reddi-agent-evaluation" with this command: npx skills add nissan/reddi-agent-evaluation

Agent Evaluation

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

Capabilities

  • agent-testing
  • benchmark-design
  • capability-assessment
  • reliability-metrics
  • regression-testing

Requirements

  • testing-fundamentals
  • llm-fundamentals

Patterns

Statistical Test Evaluation

Run tests multiple times and analyze result distributions

Behavioral Contract Testing

Define and test agent behavioral invariants

Adversarial Testing

Actively try to break agent behavior

Anti-Patterns

❌ Single-Run Testing

❌ Only Happy Path Tests

❌ Output String Matching

⚠️ Sharp Edges

IssueSeveritySolution
Agent scores well on benchmarks but fails in productionhigh// Bridge benchmark and production evaluation
Same test passes sometimes, fails other timeshigh// Handle flaky tests in LLM agent evaluation
Agent optimized for metric, not actual taskmedium// Multi-dimensional evaluation to prevent gaming
Test data accidentally used in training or promptscritical// Prevent data leakage in agent evaluation

Related Skills

Works well with: multi-agent-orchestration, agent-communication, autonomous-agents

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Automation

Google Maps Reviews Api Skill

This skill is designed to help users automatically extract reviews from Google Maps via the Google Maps Reviews API. Agent should proactively apply this skil...

Registry SourceRecently Updated
1.3K2phheng
Automation

Telegram Topic Rename

Rename Telegram forum topics and change icons via Bot API. Use when user asks to name/rename a topic, change topic title, update topic icon, or says "命名这个topic", "给话题起个名", "换个图标". Requires TELEGRAM_BOT_TOKEN environment variable.

Registry SourceRecently Updated
Automation

Mission Control

macOS-native web dashboard for monitoring and controlling your OpenClaw agent. Live chat, cron management, task workshop, scout engine, cost tracking, and more.

Registry SourceRecently Updated
Automation

AI Remote Viewing

Guide an AI agent through a full blind Remote Viewing session using the Resonant Contact Protocol (AI IS-BE) and a compact Field Perception Lexicon.

Registry SourceRecently Updated