validate-evaluator

Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) to verify alignment before trusting its outputs in production.

Safety Notice

This listing is imported from SkillsMP metadata and should be treated as untrusted until upstream source review is completed.

Copy this and send it to your AI assistant to learn

Install skill "validate-evaluator" with this command: npx skills add majidraza1228/skillsmp-majidraza1228-majidraza1228-validate-evaluator

No markdown body

This source entry does not include full markdown content beyond metadata.

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Open in GitHub

Related Skills

Related by shared tags or category signals.

General

build-review-interface

Build a custom browser-based annotation interface for reviewing LLM traces and collecting human labels. Use when reviewers are working with raw JSON files, when you need to collect Pass/Fail labels at scale, or when trace data needs domain-specific formatting to be readable.

Repository SourceNeeds Review

-0majidraza1228

General

generate-synthetic-data

Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces.

Repository SourceNeeds Review

-0majidraza1228

Research

error-analysis

Systematically identify and categorize failure modes in an LLM pipeline by reading traces. Use when starting a new eval project, after significant pipeline changes (new features, model switches, prompt rewrites), when production metrics drop, or after incidents.

Repository SourceNeeds Review

-0majidraza1228

Security

eval-audit

Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists.

Repository SourceNeeds Review

-0majidraza1228