RAG Observability and Evaluations
Run retrieval-augmented generation like a measurable production system, not a black box.
What to Measure
Retrieval Quality
-
Recall@k and MRR for top-k chunks
-
Citation coverage and source freshness
-
Embedding drift and index staleness
Generation Quality
-
Groundedness score (answer supported by retrieved context)
-
Hallucination rate by route/use case
-
Instruction adherence and format validity
Reliability and Cost
-
p50/p95 latency split by retrieval vs generation
-
Token usage per stage
-
Cache hit rate and cost per successful answer
Evaluation Pipeline
-
Curate a benchmark set with gold answers and source docs.
-
Run nightly offline evals for every retriever/model configuration.
-
Execute online shadow evals on sampled production traffic.
-
Gate releases on minimum quality + safety + latency thresholds.
Alerting Strategy
Page on:
-
sharp decline in groundedness,
-
spike in unanswered or fallback responses,
-
index freshness SLA breach,
-
cost-per-answer anomaly.
Practical Guardrails
-
Force citations for high-risk domains.
-
Return abstain/fallback when confidence is below threshold.
-
Re-rank retrieved chunks before final generation.
-
Use query rewriting only with strict regression tests.
Incident Triage Checklist
-
Did embedding model change?
-
Did chunking/indexing logic change?
-
Did source corpus ingestion fail?
-
Did gateway route to unintended model tier?
Related Skills
-
rag-infrastructure - Deploy robust RAG backends
-
agent-observability - Instrument requests, traces, and costs
-
agent-evals - Build repeatable eval suites