azure-ai-evaluation-py

Azure AI Evaluation SDK for Python. Use for evaluating generative AI applications with quality, safety, and custom evaluators. Triggers: "azure-ai-evaluation", "evaluators", "GroundednessEvaluator", "evaluate", "AI quality metrics".

Safety Notice

This listing is from the official public ClawHub registry. Review SKILL.md and referenced scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "azure-ai-evaluation-py" with this command: npx skills add thegovind/azure-ai-evaluation-py

Azure AI Evaluation SDK for Python

Assess generative AI application performance with built-in and custom evaluators.

Installation

pip install azure-ai-evaluation

# With remote evaluation support
pip install azure-ai-evaluation[remote]

Environment Variables

# For AI-assisted evaluators
AZURE_OPENAI_ENDPOINT=https://<resource>.openai.azure.com
AZURE_OPENAI_API_KEY=<your-api-key>
AZURE_OPENAI_DEPLOYMENT=gpt-4o-mini

# For Foundry project integration
AIPROJECT_CONNECTION_STRING=<your-connection-string>

Built-in Evaluators

Quality Evaluators (AI-Assisted)

from azure.ai.evaluation import (
    GroundednessEvaluator,
    RelevanceEvaluator,
    CoherenceEvaluator,
    FluencyEvaluator,
    SimilarityEvaluator,
    RetrievalEvaluator
)

# Initialize with Azure OpenAI model config
model_config = {
    "azure_endpoint": os.environ["AZURE_OPENAI_ENDPOINT"],
    "api_key": os.environ["AZURE_OPENAI_API_KEY"],
    "azure_deployment": os.environ["AZURE_OPENAI_DEPLOYMENT"]
}

groundedness = GroundednessEvaluator(model_config)
relevance = RelevanceEvaluator(model_config)
coherence = CoherenceEvaluator(model_config)

Quality Evaluators (NLP-based)

from azure.ai.evaluation import (
    F1ScoreEvaluator,
    RougeScoreEvaluator,
    BleuScoreEvaluator,
    GleuScoreEvaluator,
    MeteorScoreEvaluator
)

f1 = F1ScoreEvaluator()
rouge = RougeScoreEvaluator()
bleu = BleuScoreEvaluator()

Safety Evaluators

from azure.ai.evaluation import (
    ViolenceEvaluator,
    SexualEvaluator,
    SelfHarmEvaluator,
    HateUnfairnessEvaluator,
    IndirectAttackEvaluator,
    ProtectedMaterialEvaluator
)

violence = ViolenceEvaluator(azure_ai_project=project_scope)
sexual = SexualEvaluator(azure_ai_project=project_scope)

Single Row Evaluation

from azure.ai.evaluation import GroundednessEvaluator

groundedness = GroundednessEvaluator(model_config)

result = groundedness(
    query="What is Azure AI?",
    context="Azure AI is Microsoft's AI platform...",
    response="Azure AI provides AI services and tools."
)

print(f"Groundedness score: {result['groundedness']}")
print(f"Reason: {result['groundedness_reason']}")

Batch Evaluation with evaluate()

from azure.ai.evaluation import evaluate

result = evaluate(
    data="test_data.jsonl",
    evaluators={
        "groundedness": groundedness,
        "relevance": relevance,
        "coherence": coherence
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${data.context}",
                "response": "${data.response}"
            }
        }
    }
)

print(result["metrics"])

Composite Evaluators

from azure.ai.evaluation import QAEvaluator, ContentSafetyEvaluator

# All quality metrics in one
qa_evaluator = QAEvaluator(model_config)

# All safety metrics in one
safety_evaluator = ContentSafetyEvaluator(azure_ai_project=project_scope)

result = evaluate(
    data="data.jsonl",
    evaluators={
        "qa": qa_evaluator,
        "content_safety": safety_evaluator
    }
)

Evaluate Application Target

from azure.ai.evaluation import evaluate
from my_app import chat_app  # Your application

result = evaluate(
    data="queries.jsonl",
    target=chat_app,  # Callable that takes query, returns response
    evaluators={
        "groundedness": groundedness
    },
    evaluator_config={
        "default": {
            "column_mapping": {
                "query": "${data.query}",
                "context": "${outputs.context}",
                "response": "${outputs.response}"
            }
        }
    }
)

Custom Evaluators

Code-Based

from azure.ai.evaluation import evaluator

@evaluator
def word_count_evaluator(response: str) -> dict:
    return {"word_count": len(response.split())}

# Use in evaluate()
result = evaluate(
    data="data.jsonl",
    evaluators={"word_count": word_count_evaluator}
)

Prompt-Based

from azure.ai.evaluation import PromptChatTarget

class CustomEvaluator:
    def __init__(self, model_config):
        self.model = PromptChatTarget(model_config)
    
    def __call__(self, query: str, response: str) -> dict:
        prompt = f"Rate this response 1-5: Query: {query}, Response: {response}"
        result = self.model.send_prompt(prompt)
        return {"custom_score": int(result)}

Log to Foundry Project

from azure.ai.projects import AIProjectClient
from azure.identity import DefaultAzureCredential

project = AIProjectClient.from_connection_string(
    conn_str=os.environ["AIPROJECT_CONNECTION_STRING"],
    credential=DefaultAzureCredential()
)

result = evaluate(
    data="data.jsonl",
    evaluators={"groundedness": groundedness},
    azure_ai_project=project.scope  # Logs results to Foundry
)

print(f"View results: {result['studio_url']}")

Evaluator Reference

EvaluatorTypeMetrics
GroundednessEvaluatorAIgroundedness (1-5)
RelevanceEvaluatorAIrelevance (1-5)
CoherenceEvaluatorAIcoherence (1-5)
FluencyEvaluatorAIfluency (1-5)
SimilarityEvaluatorAIsimilarity (1-5)
RetrievalEvaluatorAIretrieval (1-5)
F1ScoreEvaluatorNLPf1_score (0-1)
RougeScoreEvaluatorNLProuge scores
ViolenceEvaluatorSafetyviolence (0-7)
SexualEvaluatorSafetysexual (0-7)
SelfHarmEvaluatorSafetyself_harm (0-7)
HateUnfairnessEvaluatorSafetyhate_unfairness (0-7)
QAEvaluatorCompositeAll quality metrics
ContentSafetyEvaluatorCompositeAll safety metrics

Best Practices

  1. Use composite evaluators for comprehensive assessment
  2. Map columns correctly — mismatched columns cause silent failures
  3. Log to Foundry for tracking and comparison across runs
  4. Create custom evaluators for domain-specific metrics
  5. Use NLP evaluators when you have ground truth answers
  6. Safety evaluators require Azure AI project scope
  7. Batch evaluation is more efficient than single-row loops

Reference Files

FileContents
references/built-in-evaluators.mdDetailed patterns for AI-assisted, NLP-based, and Safety evaluators with configuration tables
references/custom-evaluators.mdCreating code-based and prompt-based custom evaluators, testing patterns
scripts/run_batch_evaluation.pyCLI tool for running batch evaluations with quality, safety, and custom evaluators

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

GitHub Monitor

Monitor one or more GitHub repositories and send low-noise alerts with configurable policy modes (major_only, balanced, verbose). Use when setting up recurri...

Registry SourceRecently Updated
Coding

DevOps Bridge

Unified developer operations bridge connecting GitHub, CI/CD (GitHub Actions), Slack, Discord, and issue trackers (Linear, Jira, GitHub Issues) into cross-to...

Registry SourceRecently Updated
Coding

Google Keep

Read, create, edit, search, and manage Google Keep notes and lists via CLI.

Registry SourceRecently Updated
Coding

Task Panner Validator for Agents

Provides secure task planning, validation, approval, and execution for AI agents with safety checks, rollback, dry runs, and error handling using pure Python.

Registry SourceRecently Updated