Drift Detection

Monitor LLM quality degradation and input/output distribution shifts in production.

Overview

Detecting input distribution drift (data drift)
Monitoring output quality degradation (concept drift)
Implementing statistical methods (PSI, KS, KL divergence)
Setting up dynamic thresholds with moving averages
Integrating Langfuse scores with drift analysis

Quick Reference

Population Stability Index (PSI)

import numpy as np

def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float: """ Calculate Population Stability Index.

Thresholds:
- PSI &#x3C; 0.1: No significant drift
- 0.1 &#x3C;= PSI &#x3C; 0.25: Moderate drift, investigate
- PSI >= 0.25: Significant drift, action needed
"""
expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)

# Avoid division by zero
expected_pct = np.clip(expected_pct, 0.0001, None)
actual_pct = np.clip(actual_pct, 0.0001, None)

psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
return psi

Usage

psi_score = calculate_psi(baseline_scores, current_scores) if psi_score >= 0.25: alert("Significant quality drift detected!")

EWMA Dynamic Threshold

class EWMADriftDetector: """Exponential Weighted Moving Average for drift detection."""

def __init__(self, lambda_param: float = 0.2, L: float = 3.0):
    self.lambda_param = lambda_param  # Smoothing factor
    self.L = L  # Control limit multiplier
    self.ewma = None

def update(self, value: float, baseline_mean: float, baseline_std: float) -> dict:
    if self.ewma is None:
        self.ewma = value
    else:
        self.ewma = self.lambda_param * value + (1 - self.lambda_param) * self.ewma

    # Calculate control limits
    factor = np.sqrt(self.lambda_param / (2 - self.lambda_param))
    ucl = baseline_mean + self.L * baseline_std * factor
    lcl = baseline_mean - self.L * baseline_std * factor

    return {
        "ewma": self.ewma,
        "ucl": ucl,
        "lcl": lcl,
        "drift_detected": self.ewma > ucl or self.ewma &#x3C; lcl
    }

Langfuse Score Trend Monitoring

from langfuse import Langfuse

langfuse = Langfuse()

def check_quality_drift(days: int = 7, threshold_drop: float = 0.1): """Compare recent quality scores against baseline."""

# Fetch recent scores
current_scores = langfuse.fetch_scores(
    name="quality_overall",
    from_timestamp=datetime.now() - timedelta(days=1)
)

# Fetch baseline scores
baseline_scores = langfuse.fetch_scores(
    name="quality_overall",
    from_timestamp=datetime.now() - timedelta(days=days),
    to_timestamp=datetime.now() - timedelta(days=1)
)

current_mean = np.mean([s.value for s in current_scores])
baseline_mean = np.mean([s.value for s in baseline_scores])

drift_pct = (baseline_mean - current_mean) / baseline_mean

if drift_pct > threshold_drop:
    return {"drift": True, "drop_pct": drift_pct}
return {"drift": False, "drop_pct": drift_pct}

Key Decisions

Decision Recommendation

Statistical method PSI for production (stable), KS for small samples

Threshold strategy Dynamic (95th percentile of historical) over static

Baseline window 7-30 days rolling window

Alert priority Performance metrics > distribution metrics

Tool stack Langfuse (traces) + Evidently/Phoenix (drift analysis)

PSI Threshold Guidelines

PSI Value Interpretation Action

< 0.1 No significant drift Monitor

0.1 - 0.25 Moderate drift Investigate

= 0.25 Significant drift Alert + Action

Anti-Patterns

❌ NEVER use static thresholds without context

if psi > 0.2: # May cause alert fatigue alert()

❌ NEVER retrain on time schedule alone

schedule.every(7).days.do(retrain) # Wasteful if no drift

✅ ALWAYS use dynamic thresholds

threshold = np.percentile(historical_psi, 95) if psi > threshold: alert()

✅ ALWAYS correlate with performance metrics

if psi > threshold AND quality_score < baseline: trigger_evaluation()

Detailed Documentation

Resource Description

references/statistical-methods.md PSI, KS, KL divergence, Wasserstein comparison

references/embedding-drift.md Arize Phoenix, cluster monitoring, semantic drift

references/ewma-baselines.md Moving averages, dynamic thresholds, control charts

references/langfuse-evidently-integration.md Combined pipeline pattern

checklists/drift-detection-setup-checklist.md Implementation checklist

Related Skills

langfuse-observability
Score tracking for drift analysis
llm-evaluation
Quality metrics that feed drift detection
quality-gates
Threshold enforcement
observability-monitoring
General monitoring patterns

Capability Details

psi-drift

Keywords: psi, population stability index, distribution drift, histogram comparison Solves:

Detect distribution shifts in LLM inputs/outputs
Production-grade drift monitoring
Stable drift metric for large datasets

embedding-drift

Keywords: embedding drift, semantic drift, cluster, centroid, arize phoenix Solves:

Detect semantic changes in text data
Monitor RAG retrieval quality
Track embedding space shifts

quality-regression

Keywords: quality drift, score degradation, trend, moving average Solves:

Detect LLM quality degradation over time
Compare against historical baselines
Early warning for model issues

dynamic-thresholds

Keywords: ewma, dynamic threshold, adaptive, control chart Solves:

Reduce alert fatigue with adaptive thresholds
Statistical process control for LLMs
Context-aware drift alerting

canary-monitoring

Keywords: canary prompt, fixed test, regression test, behavioral drift Solves:

Track consistency with fixed test inputs
Detect behavioral changes in LLMs
Regression testing for model updates

drift-detection

Safety Notice

Copy this and send it to your AI assistant to learn

Usage

❌ NEVER use static thresholds without context

❌ NEVER retrain on time schedule alone

✅ ALWAYS use dynamic thresholds

✅ ALWAYS correlate with performance metrics

Source Transparency

Related Skills

ui-components

responsive-patterns

domain-driven-design