llm-eval-router
Set up a production-quality shadow evaluation pipeline that automatically promotes local Ollama models when they statistically prove they match cloud model quality — reducing inference costs with evidence, not hope.
The core idea
Run every task through your best local model (shadow) in parallel with your cloud baseline (ground truth). A lightweight judge ensemble scores the local output. After 200+ runs, if the local model hits 0.95 mean score, promote it to handle that task type in production. Demote it automatically if quality drops.
When to use
- You're paying for Claude/GPT API calls on tasks that don't need that quality
- You have Ollama running locally with capable models (qwen2.5, phi4, mistral, etc.)
- You want evidence-based cost reduction, not blind routing
- You have defined task types: summarize, classify, extract, format, analyze, RAG
When NOT to use
- Tasks that require real-time web knowledge (use cloud)
- Tasks with strict latency requirements < 2 seconds (local models on CPU are slow)
- Tasks with high safety stakes (always use cloud with safety filters)
- You don't have Ollama or a Mac/Linux machine with enough RAM (8GB+ per model)
Prerequisites
- Ollama installed and running (ollama.com)
- At least one capable model:
ollama pull qwen2.5orollama pull phi4 - Python 3.10+
- API keys: Anthropic (ground truth) + OpenAI (judge) — Gemini optional (tiebreaker)
- Langfuse for observability (self-hosted or cloud) — optional but strongly recommended
Network & Privacy
This skill makes outbound API calls to:
- Anthropic API — to generate ground truth baseline responses (every accumulation cycle)
- OpenAI API — for judge scoring (sampled at 15% of runs)
- Google Gemini API — tiebreaker judge only (when primary judges disagree by ≥0.20)
What stays local:
- All Ollama model inference runs entirely on your device
- Scored run data is stored on disk in
data/scores/*.json - No telemetry, analytics, or data collection of any kind
- No data is sent anywhere other than the explicit API calls above
Langfuse (optional) can be self-hosted or cloud. If self-hosted, all observability data stays on your network.
Core concepts
6-Dimension Evaluation
Every response is scored on:
| Dimension | Default weight | Analyze weight | What it measures |
|---|---|---|---|
| Structural | 25% | 10% | Format compliance, required keys present |
| Semantic | 25% | 40% | Meaning equivalence to ground truth |
| Factual | 20% | 25% | No hallucinated facts/numbers/entities |
| Completion | 15% | 18% | Task fully addressed |
| Tool use | 10% | 4% | Correct tool/format selection |
| Latency | 5% | 3% | Within acceptable bounds |
Important: Use per-task weight overrides. The default 25/25 split treats structural
accuracy equally with semantic similarity — which works for extract/classify/format tasks
(where exact format matters) but is wrong for open-ended analysis. difflib.SequenceMatcher
on two prose analyses of the same question scores ~0.29 even when they're semantically
identical. With structural weight at 25%, this alone caps analyze scores at ~0.59.
# src/evaluator.py — per-task weight profiles
TASK_WEIGHT_OVERRIDES = {
"analyze": {
"structural_accuracy": 0.10, # difflib is NOT meaningful for prose
"semantic_similarity": 0.40, # cosine over embeddings captures meaning
"factual_drift": 0.25,
"task_completion": 0.18,
"tool_correctness": 0.04,
"latency_score": 0.03,
},
"code_transform": {
"structural_accuracy": 0.15,
"semantic_similarity": 0.35,
"factual_drift": 0.20,
"task_completion": 0.20,
"tool_correctness": 0.07,
"latency_score": 0.03,
},
}
Also: For analyze tasks, constrain output structure via system_prompt so GT and candidates produce comparably-formatted responses (Finding/Recommendation/Confidence/Reasoning). This reduces Layer 2 drift and improves difflib scores even at reduced weight.
Judge ensemble
- Primary judges (15% sampling rate): Claude Sonnet + gpt-4o-mini score independently
- Tiebreaker (only when |score_A - score_B| ≥ 0.20): Gemini 2.5-flash
- Unsampled runs (85%): Layer 1+2 validators only (deterministic, free)
- Promotion gates always trigger full judge evaluation regardless of sampling rate
Layer 1+2 validators (free, deterministic)
- Layer 1: JSON validity, required key presence, forbidden pattern check
- Layer 2: Drift detection — novel entities/numbers/URLs not in ground truth
These run on every response at zero cost. Judges only run when L1+L2 pass and the sampling rate triggers.
Promotion / Demotion
- Promote: 200+ runs, rolling mean ≥ 0.95 for a model/task pair
- Demote: rolling 7-day pass rate < 0.92
- Control floor: one model (phi4, granite4, or similar) serves as the measured floor — any model scoring below it should be flagged, not promoted
Implementation steps
Step 1 — Define your task types
Create config/task_types.yaml:
tasks:
- id: summarize
description: "Summarize a document in N sentences"
require_json: false
judge_dimensions: [semantic, factual, completion]
- id: classify
description: "Classify text into one of N categories"
require_json: true # response must be valid JSON
judge_dimensions: [structural, semantic, completion]
- id: extract
description: "Extract structured data from unstructured text"
require_json: true
judge_dimensions: [structural, factual, completion]
- id: format
description: "Reformat content to match a template"
require_json: false
judge_dimensions: [structural, semantic, completion]
Step 2 — Set up the router
The router assigns each task to a model using a round-robin strategy during burn-in (building n), then switches to confidence-weighted routing after promotion.
# src/router.py — simplified version
class Router:
def __init__(self, candidates: list[str], control_floor: str):
self.candidates = candidates
self.control_floor = control_floor
self._rr_counters = defaultdict(int)
def route(self, task_type: str, confidence_tracker: ConfidenceTracker) -> str:
"""Return the best model for this task type."""
promoted = confidence_tracker.get_promoted(task_type)
if promoted:
return promoted # use promoted model directly
# Round-robin during burn-in for fair exposure
idx = self._rr_counters[task_type] % len(self.candidates)
self._rr_counters[task_type] += 1
return self.candidates[idx]
Step 3 — Ground truth comparison
For each task, run it through BOTH the local model (candidate) and the cloud baseline (ground truth). Never use the ground truth response in production — it's only for evaluation.
async def evaluate_pair(prompt: str, local_response: str, gt_response: str,
task_type: str) -> float:
# Layer 1: deterministic
l1_score = validators.layer1(local_response, task_type)
if l1_score == 0.0:
return 0.0 # hard fail — safety or format violation
# Layer 2: heuristic drift
l2_score = validators.layer2(local_response, gt_response)
# Sample judges (15%)
if random.random() < JUDGE_SAMPLE_RATE:
sonnet_score = await judge_sonnet(prompt, local_response, gt_response)
mini_score = await judge_gpt4o_mini(prompt, local_response, gt_response)
if abs(sonnet_score - mini_score) >= 0.20:
gemini_score = await judge_gemini(prompt, local_response, gt_response)
final = median([sonnet_score, mini_score, gemini_score])
else:
final = (sonnet_score + mini_score) / 2
return weighted_score(l1_score, l2_score, final)
else:
return weighted_score(l1_score, l2_score, judge_score=None)
Step 4 — Confidence tracker
Track scores per model/task pair on disk (so restarts don't lose data):
# src/scoring/confidence.py — simplified
@dataclass
class ModelStats:
model_id: str
task_type: str
scores: list[float] # all scores (None excluded)
promoted: bool = False
demoted: bool = False
@property
def mean(self) -> float:
return sum(self.scores) / len(self.scores) if self.scores else 0.0
@property
def n(self) -> int:
return len(self.scores)
def should_promote(self) -> bool:
return self.n >= 200 and self.mean >= 0.95 and not self.promoted
def should_demote(self) -> bool:
recent = self.scores[-50:] # last 50
pass_rate = sum(1 for s in recent if s >= 0.85) / len(recent)
return pass_rate < 0.92 and not self.demoted
Step 5 — Accumulator loop
Run this on a cron (every 10-20 minutes via launchd/systemd):
# run_accumulate.py
async def accumulate():
task_type = pick_next_task() # round-robin across task types
prompt, gt_response = generate_task(task_type) # call cloud baseline
for candidate in router.get_candidates(task_type):
local_response = await ollama_client.complete(candidate, prompt)
score = await evaluate_pair(prompt, local_response, gt_response, task_type)
confidence_tracker.record(candidate, task_type, score)
if confidence_tracker.should_promote(candidate, task_type):
router.promote(candidate, task_type)
langfuse.log_promotion(candidate, task_type, confidence_tracker.stats(candidate, task_type))
Step 6 — Routing policy
# config/routing_policy.yaml
control_floor_model: phi4:latest # never promote below this model's score
task_policies:
policy_check_high_risk:
never_local: true # these tasks always use cloud model
summarize:
min_score_for_routing: 0.85
fallback_chain: [qwen2.5, llama3.1, phi4]
classify:
min_score_for_routing: 0.90 # higher bar for classification
fallback_chain: [qwen2.5, granite4, llama3.1]
Step 7 — API
Expose a simple HTTP API (FastAPI):
POST /run — route a task through the best available model
GET /health — service status + promoted models + ollama connectivity
GET /status — full scoreboard (model × task × mean × n)
GET /report — cost heatmap + efficiency analysis
Key lessons learned (from 900+ production runs)
What worked:
- phi4 as control floor: a measured floor model prevents "promoted because everyone else is also bad" errors. If the floor model beats a candidate, flag it — don't promote.
- Thinking token stripping: CoT models (deepseek-r1, qwen2.5-coder with reasoning)
must have
<think>...</think>blocks stripped before evaluation. Otherwise Layer 2 drift detection flags the reasoning chain as hallucinated content. None ≠ 0.0for unsampled runs: a run where no judge scored is not a failing run. StoreNone, exclude from mean. Mixing None with 0.0 poisons the mean.require_json: Falsefor plain-text tasks: classify and extract tasks that return formatted text (not JSON objects) will fail Layer 1 if you require JSON. Separate the "is the format correct" check from "is it valid JSON."- Per-task weight overrides: do not use one weight profile for all task types. Structural accuracy (difflib) is wrong for prose analysis — use semantic similarity as the primary signal for open-ended tasks. This lifted analyze mean from 0.44–0.59 to 0.70.
- Structured output prompts for analyze tasks: add a
system_promptthat specifies an exact output format (Finding/Recommendation/Confidence/Reasoning). Both GT and candidates follow the same template, improving structural alignment and reducing drift penalty. Without this, Layer 2 drift fires on differently-phrased but correct analyses. - MCP server for agentic access: expose CP as MCP tools (
run_task,get_status,get_champions,get_promotion_timeline,get_cost_heatmap). Lets an LLM agent query evaluation state without bespoke integration work.
What didn't work:
- Large models (>9GB): gpt-oss:20b and similar required 39+ second inference — the latency dimension alone tanks the composite score. Practical ceiling is ~9GB models on 24GB unified memory to avoid GPU memory swapping.
- 100% judge sampling: runs through the full Claude+GPT+Gemini panel on every evaluation costs more in judge API fees than you save by routing locally. Sample at 15%.
- Chroma 1.5.1 with Python 3.14: Pydantic V1 BaseSettings incompatibility. Use qdrant or numpy cosine store instead.
- One-size-fits-all weight profiles: defining global weights at system init and never overriding per task type led to all analyze evals silently failing for 112+ runs. Lesson: evaluate your evaluator's scores by task type early — if a whole task type caps at a suspicious ceiling (e.g. 0.59), the metric is wrong, not the models.
Expected timeline
With a 20-minute accumulator cadence and 9 candidates × 7 task types:
- First 50 runs per model: ~5 hours
- First promotions (200 runs): ~1-2 days per model/task pair
- Stable routing layer: 1-2 weeks
Cost estimate
Per accumulation cycle (one task, one model):
- Ground truth: ~$0.002 (Claude Sonnet, ~500 input + 200 output tokens)
- Judge sample (15%): ~$0.003 (Sonnet + GPT-4o-mini)
- Local model: $0 (Ollama, on-device)
At 6 runs/hour × 24 hours: ~$0.70/day during burn-in. After first promotions: drops to ~$0.10/day (90%+ of task volume local).