evaluation & monitoring

Evaluation & Monitoring

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "evaluation & monitoring" with this command: npx skills add lauraflorentin/skills-marketplace/lauraflorentin-skills-marketplace-evaluation-monitoring

Evaluation & Monitoring

Evaluation determines how well an agent performs (correctness, helpfulness, safety), usually on a test dataset. Monitoring determines how the system is running (latency, errors, cost) in a live environment. Both are essential for the lifecycle management of AI systems.

When to Use

  • CI/CD: Rejecting code changes if they drop accuracy below a threshold.

  • A/B Testing: Comparing Prompt A vs. Prompt B to see which users prefer.

  • Cost Auditing: Understanding which agents or tools are driving up the bill.

  • Drift Detection: Noticing if the model starts hallucinating more often on new data.

Use Cases

  • LLM-as-a-Judge: Using GPT-4 to grade the answers of a smaller model.

  • Latency Tracking: Measuring the time-to-first-token (TTFT) and total generation time.

  • Topic Clustering: Analyzing user queries to see what topics are trending or failing.

Implementation Pattern

def evaluate_agent(agent, test_set): score = 0 total = len(test_set)

for case in test_set:
    # Run agent
    prediction = agent.run(case.input)
    
    # Evaluate vs Golden Answer
    # Simple exact match or fuzzy match
    if is_correct(prediction, case.expected):
        score += 1
    else:
        # Semantic Evaluation using an LLM Judge
        judge_score = llm_judge.evaluate(
            prediction, 
            case.expected
        )
        score += judge_score
        
return score / total

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

human-in-the-loop

No summary provided by upstream source.

Repository SourceNeeds Review
General

planning

No summary provided by upstream source.

Repository SourceNeeds Review
General

reflection

No summary provided by upstream source.

Repository SourceNeeds Review
General

parallelization

No summary provided by upstream source.

Repository SourceNeeds Review