scorable-integration

Integrate Scorable LLM-as-a-Judge evaluators into applications with LLM interactions. Use when users want to add evaluation, guardrails, or quality monitoring to their LLM-powered applications. Also use when users mention Scorable, judges, LLM evaluation, or safeguarding applications.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "scorable-integration" with this command: npx skills add root-signals/scorable-skills/root-signals-scorable-skills-scorable-integration

Add Scorable LLM-as-a-Judge to Your Application

These instructions guide you through creating LLM evaluation judges with Scorable and integrating them into your codebase. Scorable is a tool for creating LLM-as-a-Judge based evaluators for safeguarding applications. Judge is the Scorable term for grouping evaluations from different metrics (Helpfulness, Policy Adherence, etc...)

Overview

Your role is to:

  1. Analyze the codebase to identify LLM interactions
  2. Create judges via Scorable API to evaluate those interactions (or use an existing judge ID if provided)
  3. Integrate judge execution into the code at appropriate points
  4. Provide usage documentation for the evaluation setup

Note: These instructions work for both creating new judges from scratch and integrating existing judges. If the user provides a judge ID, you can skip the judge creation step (Step 3) and proceed directly to integration (Step 4).

Step 0: Explain the process

Before performing any analysis or technical steps, pause and clearly brief the user on what is about to happen. Explain that you will:

  • Analyze the codebase to identify LLM interactions
  • Create judges via Scorable API to evaluate those interactions
  • Integrate judge execution into the code at appropriate points
  • Provide usage documentation for the evaluation setup

Step 1: Analyze the Application

Examine the codebase to understand:

  • What LLM interactions exist (prompts, completions, agent calls)
  • What the application does at each interaction point
  • Which interactions are most critical to evaluate

If multiple LLM interactions exist, help the user prioritize. Recommend starting with the most critical one first.


Step 2: Get Scorable API Key

Ask the user which option they prefer:

Option A: Permanent API Key (Recommended)

Direct them to: https://scorable.ai/api-key-setup

  1. Sign in with SSO or email/password
  2. Click "Create API Key"
  3. Copy the key and ask them to store it to .env, environment variable

Security: Instruct the user to use environment variables or the project's secret management. Use existing .env files if available or ask user to save it as environment variable. Do not ask the user to paste the key into this session.


Option B: Temporary API Key (Testing Only)

curl --request POST      --url https://api.scorable.ai/create-demo-user/      --header 'accept: application/json'      --header 'content-type: application/json'

Response includes api_key field. Warn the user appropriately that:

  • The judge will be public and visible to everyone
  • The key only works for a limited time
  • For private judges, they should create a permanent key

Remember also the api_token field. It is used in the URL parameters for the judge URL, not in any other context.


Option C: Existing API Key

If they have an account: https://scorable.ai/settings/api-keys


Step 3: Generate a Judge

Note: If the user has already provided a judge ID (e.g., in their message), you can skip this step and proceed directly to Step 4 (Integration).

Call the /v1/judges/generate/ endpoint with a detailed intent string.

Intent String Guidelines:

  • Describe the application context and what you're evaluating
  • Mention the specific execution point (stage name)
  • Include critical quality dimensions you care about
  • Add examples, documentation links, mandatory tool calls, or policies if relevant
  • Be specific and detailed (multiple sentences/paragraphs are good)
  • Code level details like frameworks, libraries, etc. do not need to be mentioned

Example with all required fields filled:

curl --request POST \
  --url https://api.scorable.ai/v1/judges/generate/ \
  --header 'accept: application/json' \
  --header 'content-type: application/json' \
  --header 'Authorization: Api-Key <SCORABLE_API_KEY>' \
  --data '{
    "visibility": "unlisted", # or public if using a temporary key
    "intent": "An email automation system that creates summary emails using an LLM based on database query results and user input. Evaluate the LLM output for: accuracy in summarizing data, appropriate tone for the audience, inclusion of all key information from queries, proper formatting, and absence of hallucinations. The system is used for customer-facing communications.",
    "generating_model_params": {
      "temperature": 0.2,
      "reasoning_effort": "medium"
    }
  }'

Optional fields:

  • enable_context_aware_evaluators: Set to true if the application interaction uses RAG (document chunks) that are relevant and can be extracted to the evaluation (hallucinations, context drift, etc.).

Note that this can take up to 2 minutes to complete.

Handling API Responses:

The API may return:

1. missing_context_from_system_goal - Additional context needed:

{
  "missing_context_from_system_goal": [
    {
      "form_field_name": "target_audience",
      "form_field_description": "The intended audience for the content"
    }
  ]
}

→ Ask the user for these details (if not evident from the code base), then call /v1/judges/generate/ again with:

{
  "judge_id": "existing-judge-id",
  "stage": "Stage name",
  "extra_contexts": {
    "target_audience": "Enterprise customers"
  },
  ...other fields...
}

2. multiple_stages - Judge detected multiple evaluation points:

{
  "error_code": "multiple_stages",
  "stages": ["Stage 1", "Stage 2", "Stage 3"]
}

→ Ask the user which stage to focus on, or if they have a custom stage name. Each judge evaluates one stage. You can create additional judges later for other stages.

3. Success - Judge created:

{
  "judge_id": "abc123...",
  "evaluator_details": [...]
}

→ Proceed to integration.


Step 4: Integrate Judge Execution

Add code to evaluate LLM outputs at the appropriate execution point(s).

Language-Specific Integration

Choose the appropriate integration guide based on the codebase language:

Integration Points

  • Insert evaluation code where LLM outputs are generated (for example after an OpenAI responses call)
  • response parameter: The text you want to evaluate (required)
  • request parameter: The input that prompted the response (optional but recommended)
  • Use actual variables from your code, not static strings

Multi-Turn / Agent + Tool Calls evaluation

If a multi-turn conversation is detected, use the multi-turn format to evaluate the entire conversation flow. This may also include tool calls. Confirm from user if multi-turn evaluation would suit their needs. See language-specific guides for details.


Step 5: Provide Next Steps

After integration:

  1. Ask about additional judges: If multiple stages were identified, ask if the user wants to create judges for other stages
  2. Discuss evaluation strategy:
    • Should every LLM call be evaluated or sampled (e.g., 10%)?
    • Should scores be stored in a database for analysis?
    • Should specific actions trigger based on scores (e.g., alerts for low scores)?
    • Batch evaluation vs real-time evaluation?
  3. Provide judge details:
    • Judge URL: https://scorable.ai/judge/{judge_id}
    • How to view results in the Scorable dashboard (https://scorable.ai/dashboard)
    • If temporary key was used, a note that it only works for a certain amount of time and they should create an account with a permanent key
  4. Link to docs: https://docs.scorable.ai

Key Implementation Notes

  • Install SDK first: Check which dependency management system is used and install the appropriate package.
  • Store API keys securely: Use environment variables, not hardcoded strings
  • Handle errors gracefully: Evaluation failures shouldn't break your application
  • Start simple: Evaluate one stage first, then expand
  • Sampling for production: 5-10% sampling reduces costs while maintaining visibility
  • Non-blocking: The evaluation should not block the main thread and not slow down the application
  • Common patterns: See language-specific reference files for integration patterns (development, production sampling, batch evaluation)

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

media-compress

Compress and convert images and videos using ffmpeg. Use when the user wants to reduce file size, change format, resize, or optimize media files. Handles common formats like JPG, PNG, WebP, MP4, MOV, WebM. Triggers on phrases like "compress image", "compress video", "reduce file size", "convert to webp/mp4", "resize image", "make image smaller", "batch compress", "optimize media".

Archived SourceRecently Updated
General

humanizer

Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, negative parallelisms, and excessive conjunctive phrases.

Archived SourceRecently Updated
General

Drawing

Generate children's drawings and coloring pages with modular prompts, style packs, and print-ready constraints across image models.

Archived SourceRecently Updated
General

ht-skills

管理灏天文库文集和文档,支持新建文集、新建文档、查询文集/文档、更新文档、修改文档归属、管理文档层级。适用于 OpenClaw 自主写文章并上传、文集创建、文档入库、文档移动等场景。

Archived SourceRecently Updated