ai-error-analysis-and-eval-design

To build great AI products, you must transition from subjective "vibe checks" to systematic measurement. This process identifies exactly where an LLM is failing and creates a feedback loop for continuous improvement.

Phase 1: Open Coding (The "Benevolent Dictator" Phase)

Before automating, you must manually ground yourself in the data. Appoint one "Benevolent Dictator"—typically the Product Manager or domain expert—to define "good" taste.

Sample the Data: Extract 50–100 "traces" (logs of full LLM interactions) from your observability tool (e.g., Braintrust, LangSmith, Phoenix).
Note the Upstream Error: Read each trace. If something is wrong, write a brief, informal note (an "Open Code") describing the first thing that went wrong.
Rule: Don't overthink it. Use specific language (e.g., "hallucinated virtual tour," "didn't confirm call transfer") rather than just "bad."
Stop at Saturation: Continue until you stop learning new ways the system fails (Theoretical Saturation).

Phase 2: Axial Coding (Categorization)

Synthesize your mess of notes into actionable categories using an LLM.

Export Notes: Put your open codes into a CSV or spreadsheet.
Synthesize Failure Modes: Use an LLM (Claude or ChatGPT) to group your notes into 5–7 "Axial Codes" (failure categories).
Prompt Pattern: "Analyze these manual notes from AI traces and group them into actionable failure categories (Axial Codes). Each category should represent a specific product problem."
Map Back: Use a spreadsheet formula or LLM to categorize every trace into one of these buckets.
Prioritize: Create a pivot table to count the frequency of each category. Focus your engineering efforts on the highest-frequency or highest-risk buckets.

Phase 3: Build the "LLM as Judge"

For complex, subjective failures (like "human handoff quality"), create an automated evaluator.

Write the Judge Prompt: Create a separate prompt for an LLM whose only job is to evaluate one specific failure mode.
Enforce Binary Scoring: Require the judge to output only True or False.
Note: Avoid 1–5 or 1–10 scales. They result in "weasel" metrics (e.g., a score of 3.7) that provide no clear direction for improvement.
Define Rules: Include specific criteria from your "Benevolent Dictator" notes.
Example: "Output True if the user explicitly asked for a human and the assistant responded with a tool call without acknowledging the request."

Phase 4: Alignment & Validation

Never ship an eval until you know the judge matches human judgment.

Create an Agreement Matrix: Compare the Judge's True/False labels against your manual labels from Phase 1.
Review Mismatches: Specifically look at:
False Positives: Judge said error, Human said no error.
False Negatives: Human said error, Judge said no error.
Iterate: Refine the Judge's prompt until it aligns with the "Benevolent Dictator" at least 80–90% of the time.

Examples

Example 1: Real Estate AI Assistant

Context: AI is supposed to book apartment tours.
Open Code: "AI told the user a virtual tour was available when the property only offers in-person tours."
Axial Code: "Capability Misrepresentation."
Judge Logic: "Check the 'Property Context' tool output. If 'virtual_tour' is False, but the LLM response contains 'virtual tour,' output True (Error)."

Example 2: Customer Support Handoff

Context: AI should hand off to a human for sensitive issues.
Open Code: "User said they were frustrated with a leak, AI just gave a generic maintenance link."
Axial Code: "Handoff Protocol Violation."
Judge Logic: "Search for sentiment indicating frustration or emergency. If found, did the AI offer a human transfer? If no, output True (Error)."

Common Pitfalls

Likert Scales: Using 1–5 scales makes it impossible to know if a change in score is meaningful. Use binary True/False.
Automating Too Early: Do not let an LLM do the initial "Open Coding." It lacks the product context to know what "janky" looks like for your specific business.
Committee Judging: Don't use a committee to define "good." Appoint one person with the best domain taste to be the final arbiter (The Benevolent Dictator).
Chasing Generic Metrics: Don't rely on generic evals like "hallucination score" or "cosine similarity." They rarely correlate with product-specific success.

ai-error-analysis-and-eval-design

Safety Notice

Copy this and send it to your AI assistant to learn

Source Transparency

Related Skills

agentic-workflow-automation

agentic-engineering-workflow

b2b-saas-workflow-strategy