llm-testing

Test AI applications with deterministic patterns using DeepEval and RAGAS.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "llm-testing" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-llm-testing

LLM Testing Patterns

Test AI applications with deterministic patterns using DeepEval and RAGAS.

Quick Reference

Mock LLM Responses

from unittest.mock import AsyncMock, patch

@pytest.fixture def mock_llm(): mock = AsyncMock() mock.return_value = {"content": "Mocked response", "confidence": 0.85} return mock

@pytest.mark.asyncio async def test_with_mocked_llm(mock_llm): with patch("app.core.model_factory.get_model", return_value=mock_llm): result = await synthesize_findings(sample_findings) assert result["summary"] is not None

DeepEval Quality Testing

from deepeval import assert_test from deepeval.test_case import LLMTestCase from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric

test_case = LLMTestCase( input="What is the capital of France?", actual_output="The capital of France is Paris.", retrieval_context=["Paris is the capital of France."], )

metrics = [ AnswerRelevancyMetric(threshold=0.7), FaithfulnessMetric(threshold=0.8), ]

assert_test(test_case, metrics)

Timeout Testing

import asyncio import pytest

@pytest.mark.asyncio async def test_respects_timeout(): with pytest.raises(asyncio.TimeoutError): async with asyncio.timeout(0.1): await slow_llm_call()

Quality Metrics ()

Metric Threshold Purpose

Answer Relevancy ≥ 0.7 Response addresses question

Faithfulness ≥ 0.8 Output matches context

Hallucination ≤ 0.3 No fabricated facts

Context Precision ≥ 0.7 Retrieved contexts relevant

Anti-Patterns (FORBIDDEN)

❌ NEVER test against live LLM APIs in CI

response = await openai.chat.completions.create(...)

❌ NEVER use random seeds (non-deterministic)

model.generate(seed=random.randint(0, 100))

❌ NEVER skip timeout handling

await llm_call() # No timeout!

✅ ALWAYS mock LLM in unit tests

with patch("app.llm", mock_llm): result = await function_under_test()

✅ ALWAYS use VCR.py for integration tests

@pytest.mark.vcr() async def test_llm_integration(): ...

Key Decisions

Decision Recommendation

Mock vs VCR VCR for integration, mock for unit

Timeout Always test with < 1s timeout

Schema validation Test both valid and invalid

Edge cases Test all null/empty paths

Quality metrics Use multiple dimensions (3-5)

Detailed Documentation

Resource Description

references/deepeval-ragas-api.md DeepEval & RAGAS API reference

examples/test-patterns.md Complete test examples

checklists/llm-test-checklist.md Setup and review checklists

scripts/llm-test-template.py Starter test template

Related Skills

  • vcr-http-recording

  • Record LLM responses

  • llm-evaluation

  • Quality assessment

  • unit-testing

  • Test fundamentals

Capability Details

llm-response-mocking

Keywords: mock LLM, fake response, stub LLM, mock AI Solves:

  • Mock LLM responses in tests

  • Create deterministic AI test fixtures

  • Avoid live API calls in CI

async-timeout-testing

Keywords: timeout, async test, wait for, polling Solves:

  • Test async LLM operations

  • Handle timeout scenarios

  • Implement polling assertions

structured-output-validation

Keywords: structured output, JSON validation, schema validation, output format Solves:

  • Validate structured LLM output

  • Test JSON schema compliance

  • Assert output structure

deepeval-assertions

Keywords: DeepEval, assert_test, LLMTestCase, metric assertion Solves:

  • Use DeepEval for LLM assertions

  • Implement metric-based tests

  • Configure quality thresholds

golden-dataset-testing

Keywords: golden dataset, golden test, reference output, expected output Solves:

  • Test against golden datasets

  • Compare with reference outputs

  • Implement regression testing

vcr-recording

Keywords: VCR, cassette, record, replay, HTTP recording Solves:

  • Record LLM API responses

  • Replay recordings in tests

  • Create deterministic test suites

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review
General

dashboard-patterns

No summary provided by upstream source.

Repository SourceNeeds Review