LLM Testing Patterns
Test AI applications with deterministic patterns using DeepEval and RAGAS.
Quick Reference
Mock LLM Responses
from unittest.mock import AsyncMock, patch
@pytest.fixture def mock_llm(): mock = AsyncMock() mock.return_value = {"content": "Mocked response", "confidence": 0.85} return mock
@pytest.mark.asyncio async def test_with_mocked_llm(mock_llm): with patch("app.core.model_factory.get_model", return_value=mock_llm): result = await synthesize_findings(sample_findings) assert result["summary"] is not None
DeepEval Quality Testing
from deepeval import assert_test from deepeval.test_case import LLMTestCase from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
test_case = LLMTestCase( input="What is the capital of France?", actual_output="The capital of France is Paris.", retrieval_context=["Paris is the capital of France."], )
metrics = [ AnswerRelevancyMetric(threshold=0.7), FaithfulnessMetric(threshold=0.8), ]
assert_test(test_case, metrics)
Timeout Testing
import asyncio import pytest
@pytest.mark.asyncio async def test_respects_timeout(): with pytest.raises(asyncio.TimeoutError): async with asyncio.timeout(0.1): await slow_llm_call()
Quality Metrics ()
Metric Threshold Purpose
Answer Relevancy ≥ 0.7 Response addresses question
Faithfulness ≥ 0.8 Output matches context
Hallucination ≤ 0.3 No fabricated facts
Context Precision ≥ 0.7 Retrieved contexts relevant
Anti-Patterns (FORBIDDEN)
❌ NEVER test against live LLM APIs in CI
response = await openai.chat.completions.create(...)
❌ NEVER use random seeds (non-deterministic)
model.generate(seed=random.randint(0, 100))
❌ NEVER skip timeout handling
await llm_call() # No timeout!
✅ ALWAYS mock LLM in unit tests
with patch("app.llm", mock_llm): result = await function_under_test()
✅ ALWAYS use VCR.py for integration tests
@pytest.mark.vcr() async def test_llm_integration(): ...
Key Decisions
Decision Recommendation
Mock vs VCR VCR for integration, mock for unit
Timeout Always test with < 1s timeout
Schema validation Test both valid and invalid
Edge cases Test all null/empty paths
Quality metrics Use multiple dimensions (3-5)
Detailed Documentation
Resource Description
references/deepeval-ragas-api.md DeepEval & RAGAS API reference
examples/test-patterns.md Complete test examples
checklists/llm-test-checklist.md Setup and review checklists
scripts/llm-test-template.py Starter test template
Related Skills
-
vcr-http-recording
-
Record LLM responses
-
llm-evaluation
-
Quality assessment
-
unit-testing
-
Test fundamentals
Capability Details
llm-response-mocking
Keywords: mock LLM, fake response, stub LLM, mock AI Solves:
-
Mock LLM responses in tests
-
Create deterministic AI test fixtures
-
Avoid live API calls in CI
async-timeout-testing
Keywords: timeout, async test, wait for, polling Solves:
-
Test async LLM operations
-
Handle timeout scenarios
-
Implement polling assertions
structured-output-validation
Keywords: structured output, JSON validation, schema validation, output format Solves:
-
Validate structured LLM output
-
Test JSON schema compliance
-
Assert output structure
deepeval-assertions
Keywords: DeepEval, assert_test, LLMTestCase, metric assertion Solves:
-
Use DeepEval for LLM assertions
-
Implement metric-based tests
-
Configure quality thresholds
golden-dataset-testing
Keywords: golden dataset, golden test, reference output, expected output Solves:
-
Test against golden datasets
-
Compare with reference outputs
-
Implement regression testing
vcr-recording
Keywords: VCR, cassette, record, replay, HTTP recording Solves:
-
Record LLM API responses
-
Replay recordings in tests
-
Create deterministic test suites