Drift Detection
Monitor LLM quality degradation and input/output distribution shifts in production.
Overview
-
Detecting input distribution drift (data drift)
-
Monitoring output quality degradation (concept drift)
-
Implementing statistical methods (PSI, KS, KL divergence)
-
Setting up dynamic thresholds with moving averages
-
Integrating Langfuse scores with drift analysis
Quick Reference
Population Stability Index (PSI)
import numpy as np
def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float: """ Calculate Population Stability Index.
Thresholds:
- PSI < 0.1: No significant drift
- 0.1 <= PSI < 0.25: Moderate drift, investigate
- PSI >= 0.25: Significant drift, action needed
"""
expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)
# Avoid division by zero
expected_pct = np.clip(expected_pct, 0.0001, None)
actual_pct = np.clip(actual_pct, 0.0001, None)
psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
return psi
Usage
psi_score = calculate_psi(baseline_scores, current_scores) if psi_score >= 0.25: alert("Significant quality drift detected!")
EWMA Dynamic Threshold
class EWMADriftDetector: """Exponential Weighted Moving Average for drift detection."""
def __init__(self, lambda_param: float = 0.2, L: float = 3.0):
self.lambda_param = lambda_param # Smoothing factor
self.L = L # Control limit multiplier
self.ewma = None
def update(self, value: float, baseline_mean: float, baseline_std: float) -> dict:
if self.ewma is None:
self.ewma = value
else:
self.ewma = self.lambda_param * value + (1 - self.lambda_param) * self.ewma
# Calculate control limits
factor = np.sqrt(self.lambda_param / (2 - self.lambda_param))
ucl = baseline_mean + self.L * baseline_std * factor
lcl = baseline_mean - self.L * baseline_std * factor
return {
"ewma": self.ewma,
"ucl": ucl,
"lcl": lcl,
"drift_detected": self.ewma > ucl or self.ewma < lcl
}
Langfuse Score Trend Monitoring
from langfuse import Langfuse
langfuse = Langfuse()
def check_quality_drift(days: int = 7, threshold_drop: float = 0.1): """Compare recent quality scores against baseline."""
# Fetch recent scores
current_scores = langfuse.fetch_scores(
name="quality_overall",
from_timestamp=datetime.now() - timedelta(days=1)
)
# Fetch baseline scores
baseline_scores = langfuse.fetch_scores(
name="quality_overall",
from_timestamp=datetime.now() - timedelta(days=days),
to_timestamp=datetime.now() - timedelta(days=1)
)
current_mean = np.mean([s.value for s in current_scores])
baseline_mean = np.mean([s.value for s in baseline_scores])
drift_pct = (baseline_mean - current_mean) / baseline_mean
if drift_pct > threshold_drop:
return {"drift": True, "drop_pct": drift_pct}
return {"drift": False, "drop_pct": drift_pct}
Key Decisions
Decision Recommendation
Statistical method PSI for production (stable), KS for small samples
Threshold strategy Dynamic (95th percentile of historical) over static
Baseline window 7-30 days rolling window
Alert priority Performance metrics > distribution metrics
Tool stack Langfuse (traces) + Evidently/Phoenix (drift analysis)
PSI Threshold Guidelines
PSI Value Interpretation Action
< 0.1 No significant drift Monitor
0.1 - 0.25 Moderate drift Investigate
= 0.25 Significant drift Alert + Action
Anti-Patterns
❌ NEVER use static thresholds without context
if psi > 0.2: # May cause alert fatigue alert()
❌ NEVER retrain on time schedule alone
schedule.every(7).days.do(retrain) # Wasteful if no drift
✅ ALWAYS use dynamic thresholds
threshold = np.percentile(historical_psi, 95) if psi > threshold: alert()
✅ ALWAYS correlate with performance metrics
if psi > threshold AND quality_score < baseline: trigger_evaluation()
Detailed Documentation
Resource Description
references/statistical-methods.md PSI, KS, KL divergence, Wasserstein comparison
references/embedding-drift.md Arize Phoenix, cluster monitoring, semantic drift
references/ewma-baselines.md Moving averages, dynamic thresholds, control charts
references/langfuse-evidently-integration.md Combined pipeline pattern
checklists/drift-detection-setup-checklist.md Implementation checklist
Related Skills
-
langfuse-observability
-
Score tracking for drift analysis
-
llm-evaluation
-
Quality metrics that feed drift detection
-
quality-gates
-
Threshold enforcement
-
observability-monitoring
-
General monitoring patterns
Capability Details
psi-drift
Keywords: psi, population stability index, distribution drift, histogram comparison Solves:
-
Detect distribution shifts in LLM inputs/outputs
-
Production-grade drift monitoring
-
Stable drift metric for large datasets
embedding-drift
Keywords: embedding drift, semantic drift, cluster, centroid, arize phoenix Solves:
-
Detect semantic changes in text data
-
Monitor RAG retrieval quality
-
Track embedding space shifts
quality-regression
Keywords: quality drift, score degradation, trend, moving average Solves:
-
Detect LLM quality degradation over time
-
Compare against historical baselines
-
Early warning for model issues
dynamic-thresholds
Keywords: ewma, dynamic threshold, adaptive, control chart Solves:
-
Reduce alert fatigue with adaptive thresholds
-
Statistical process control for LLMs
-
Context-aware drift alerting
canary-monitoring
Keywords: canary prompt, fixed test, regression test, behavioral drift Solves:
-
Track consistency with fixed test inputs
-
Detect behavioral changes in LLMs
-
Regression testing for model updates