langsmith-evaluator

Use this skill for ANY question about CREATING evaluators. Covers creating custom metrics, LLM as Judge evaluators, code-based evaluators, and uploading evaluation logic to LangSmith. Does NOT cover RUNNING evaluations.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "langsmith-evaluator" with this command: npx skills add jackjin1997/clawforge/jackjin1997-clawforge-langsmith-evaluator

LangSmith Evaluator

Create evaluators to measure agent performance on your datasets. LangSmith supports two types: LLM as Judge (uses LLM to grade outputs) and Custom Code (deterministic logic).

Setup

Environment Variables

LANGSMITH_API_KEY=lsv2_pt_your_api_key_here          # Required
LANGSMITH_WORKSPACE_ID=your-workspace-id              # Optional: for org-scoped keys
OPENAI_API_KEY=your_openai_key                        # For LLM as Judge

Dependencies

pip install langsmith langchain-openai python-dotenv

Evaluator Format

Evaluators support two function signatures:

Method 1: Dict Parameters (For running evaluations locally):

def evaluator_name(inputs: dict, outputs: dict, reference_outputs: dict = None) -> dict:
    """Evaluate a single prediction."""
    user_query = inputs.get("query", "")
    agent_response = outputs.get("expected_response", "")
    expected = reference_outputs.get("expected_response", "") if reference_outputs else None

    return {
        "key": "metric_name",    # Metric identifier
        "score": 0.85,           # Number or boolean
        "comment": "Reason..."   # Optional explanation
    }

Method 2: Run/Example Parameters (For uploading to LangSmith):

def evaluator_name(run, example):
    """Evaluate using run/example dicts.

    Args:
        run: Dict with run["outputs"] containing agent outputs
        example: Dict with example["outputs"] containing expected outputs
    """
    agent_response = run["outputs"].get("expected_response", "")
    expected = example["outputs"].get("expected_response", "")

    return {
        "metric_name": 0.85,      # Metric name as key directly
        "comment": "Reason..."    # Optional explanation
    }

LLM as Judge Evaluators

Use structured output for reliable grading:

from typing import TypedDict, Annotated
from langchain_openai import ChatOpenAI

class AccuracyGrade(TypedDict):
    """Structured evaluation output."""
    reasoning: Annotated[str, ..., "Explain your reasoning"]
    is_accurate: Annotated[bool, ..., "True if response is accurate"]
    confidence: Annotated[float, ..., "Confidence 0.0-1.0"]

# Configure model with structured output
judge = ChatOpenAI(model="gpt-4o-mini", temperature=0).with_structured_output(
    AccuracyGrade, method="json_schema", strict=True
)

async def accuracy_evaluator(run, example):
    """Evaluate factual accuracy for LangSmith upload."""
    expected = example["outputs"].get('expected_response', '')
    agent_output = run["outputs"].get('expected_response', '')

    prompt = f"""Expected: {expected}
Agent Output: {agent_output}
Evaluate accuracy:"""

    grade = await judge.ainvoke([{"role": "user", "content": prompt}])

    return {
        "accuracy": 1 if grade["is_accurate"] else 0,
        "comment": f"{grade['reasoning']} (confidence: {grade['confidence']})"
    }

Common Metrics: Completeness, correctness, helpfulness, professionalism

Custom Code Evaluators

Exact Match

def exact_match_evaluator(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()

    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

Trajectory Validation

def trajectory_evaluator(run, example):
    """Evaluate tool call sequence."""
    trajectory = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Exact sequence match
    exact = trajectory == expected

    # All required tools used (order-agnostic)
    all_tools = set(expected).issubset(set(trajectory))

    # Efficiency: count extra steps
    extra_steps = len(trajectory) - len(expected)

    return {
        "trajectory_match": 1 if exact else 0,
        "comment": f"Exact: {exact}, All tools: {all_tools}, Extra: {extra_steps}"
    }

Single Step Validation

def single_step_evaluator(run, example):
    """Evaluate single node output."""
    output = run["outputs"].get("output", {})
    expected = example["outputs"].get("expected_output", {})
    node_name = run["outputs"].get("node_name", "")

    # For classification nodes
    if "classification" in node_name:
        classification = output.get("classification", "")
        expected_class = expected.get("classification", "")
        match = classification.lower() == expected_class.lower()

        return {
            "classification_correct": 1 if match else 0,
            "comment": f"Output: {classification}, Expected: {expected_class}"
        }

    # For other nodes
    match = output == expected
    return {
        "output_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }

Running Evaluations

from langsmith import Client

client = Client()

# Define your agent function
def run_agent(inputs: dict) -> dict:
    """Your agent invocation logic."""
    result = your_agent.invoke(inputs)
    return {"expected_response": result}

# Run evaluation
results = await client.aevaluate(
    run_agent,
    data="Skills: Final Response",              # Dataset name
    evaluators=[
        exact_match_evaluator,
        accuracy_evaluator,
        trajectory_evaluator
    ],
    experiment_prefix="skills-eval-v1",
    max_concurrency=4
)

Upload Evaluators to LangSmith

The upload script is a utility tool to deploy your custom evaluators to LangSmith. Write evaluators specific to your use case, then upload them.

Navigate to skills/langsmith-evaluator/scripts/ to upload evaluators.

Important: LangSmith API requires evaluators to use (run, example) signature where:

  • run: dict with run["outputs"] containing agent outputs
  • example: dict with example["outputs"] containing expected outputs

Create Evaluator File

# my_project/evaluators/custom_evals.py

def my_custom_evaluator(run, example):
    """Your custom evaluation logic.

    Args:
        run: Dict with run["outputs"] - agent outputs
        example: Dict with example["outputs"] - expected outputs

    Returns:
        Dict with metric_name as key, score as value, optional comment
    """
    # Extract relevant data
    agent_output = run["outputs"].get("expected_trajectory", [])
    expected = example["outputs"].get("expected_trajectory", [])

    # Your custom logic here
    match = agent_output == expected

    return {
        "my_metric": 1 if match else 0,
        "comment": "Custom reasoning here"
    }

Upload

# List existing evaluators
python upload_evaluators.py list

# Upload evaluator
python upload_evaluators.py upload my_evaluators.py \
  --name "Trajectory Match" \
  --function trajectory_match \
  --dataset "Skills: Trajectory" \
  --replace

# Delete evaluator (will prompt for confirmation)
python upload_evaluators.py delete "Trajectory Match"

# Skip confirmation prompts (use with caution)
python upload_evaluators.py delete "Trajectory Match" --yes
python upload_evaluators.py upload my_evaluators.py \
  --name "Trajectory Match" \
  --function trajectory_match \
  --replace --yes

Options:

  • --name - Display name in LangSmith
  • --function - Function name to extract
  • --dataset - Target dataset name
  • --project - Target project name
  • --sample-rate - Sampling rate (0.0-1.0)
  • --replace - Replace if exists (will prompt for confirmation)
  • --yes - Skip confirmation prompts for replace/delete operations

IMPORTANT - Safety Prompts:

  • The script prompts for confirmation before any destructive operations (delete, replace)
  • ALWAYS respect these prompts - wait for user input before proceeding
  • NEVER use --yes flag unless the user explicitly requests it
  • The --yes flag skips all safety prompts and should only be used in automated workflows when explicitly authorized by the user

Best Practices

  1. Use structured output for LLM judges - More reliable than parsing free-text
  2. Match evaluator to dataset type
    • Final Response → LLM as Judge for quality, Custom Code for format
    • Single Step → Custom Code for exact match
    • Trajectory → Custom Code for sequence/efficiency
  3. Combine multiple evaluators - Run both subjective (LLM) and objective (code)
  4. Use async for LLM judges - Enables parallel evaluation, much faster
  5. Test evaluators independently - Validate on known good/bad examples first
  6. Upload to LangSmith - Automatic evaluation on new runs

Example Workflow

# 1. Create evaluators file
cat > evaluators.py <<'EOF'
def exact_match(run, example):
    """Check if output exactly matches expected."""
    output = run["outputs"].get("expected_response", "").strip().lower()
    expected = example["outputs"].get("expected_response", "").strip().lower()
    match = output == expected
    return {
        "exact_match": 1 if match else 0,
        "comment": f"Match: {match}"
    }
EOF

# 2. Upload to LangSmith
python upload_evaluators.py upload evaluators.py \
  --name "Exact Match" \
  --function exact_match \
  --dataset "Skills: Final Response" \
  --replace

# 3. Evaluator runs automatically on new dataset runs

Resources

Related Skills

  • Use langsmith-trace skill to query and export traces
  • Use langsmith-dataset skill to generate evaluation datasets from traces

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

Coding

clean-code-zh

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

clean-code

No summary provided by upstream source.

Repository SourceNeeds Review
Coding

github-trending-analyzer

No summary provided by upstream source.

Repository SourceNeeds Review