drift-detection

Monitor LLM quality degradation and input/output distribution shifts in production.

Safety Notice

This listing is imported from skills.sh public index metadata. Review upstream SKILL.md and repository scripts before running.

Copy this and send it to your AI assistant to learn

Install skill "drift-detection" with this command: npx skills add yonatangross/orchestkit/yonatangross-orchestkit-drift-detection

Drift Detection

Monitor LLM quality degradation and input/output distribution shifts in production.

Overview

  • Detecting input distribution drift (data drift)

  • Monitoring output quality degradation (concept drift)

  • Implementing statistical methods (PSI, KS, KL divergence)

  • Setting up dynamic thresholds with moving averages

  • Integrating Langfuse scores with drift analysis

Quick Reference

Population Stability Index (PSI)

import numpy as np

def calculate_psi(expected: np.ndarray, actual: np.ndarray, bins: int = 10) -> float: """ Calculate Population Stability Index.

Thresholds:
- PSI < 0.1: No significant drift
- 0.1 <= PSI < 0.25: Moderate drift, investigate
- PSI >= 0.25: Significant drift, action needed
"""
expected_pct = np.histogram(expected, bins=bins)[0] / len(expected)
actual_pct = np.histogram(actual, bins=bins)[0] / len(actual)

# Avoid division by zero
expected_pct = np.clip(expected_pct, 0.0001, None)
actual_pct = np.clip(actual_pct, 0.0001, None)

psi = np.sum((actual_pct - expected_pct) * np.log(actual_pct / expected_pct))
return psi

Usage

psi_score = calculate_psi(baseline_scores, current_scores) if psi_score >= 0.25: alert("Significant quality drift detected!")

EWMA Dynamic Threshold

class EWMADriftDetector: """Exponential Weighted Moving Average for drift detection."""

def __init__(self, lambda_param: float = 0.2, L: float = 3.0):
    self.lambda_param = lambda_param  # Smoothing factor
    self.L = L  # Control limit multiplier
    self.ewma = None

def update(self, value: float, baseline_mean: float, baseline_std: float) -> dict:
    if self.ewma is None:
        self.ewma = value
    else:
        self.ewma = self.lambda_param * value + (1 - self.lambda_param) * self.ewma

    # Calculate control limits
    factor = np.sqrt(self.lambda_param / (2 - self.lambda_param))
    ucl = baseline_mean + self.L * baseline_std * factor
    lcl = baseline_mean - self.L * baseline_std * factor

    return {
        "ewma": self.ewma,
        "ucl": ucl,
        "lcl": lcl,
        "drift_detected": self.ewma > ucl or self.ewma < lcl
    }

Langfuse Score Trend Monitoring

from langfuse import Langfuse

langfuse = Langfuse()

def check_quality_drift(days: int = 7, threshold_drop: float = 0.1): """Compare recent quality scores against baseline."""

# Fetch recent scores
current_scores = langfuse.fetch_scores(
    name="quality_overall",
    from_timestamp=datetime.now() - timedelta(days=1)
)

# Fetch baseline scores
baseline_scores = langfuse.fetch_scores(
    name="quality_overall",
    from_timestamp=datetime.now() - timedelta(days=days),
    to_timestamp=datetime.now() - timedelta(days=1)
)

current_mean = np.mean([s.value for s in current_scores])
baseline_mean = np.mean([s.value for s in baseline_scores])

drift_pct = (baseline_mean - current_mean) / baseline_mean

if drift_pct > threshold_drop:
    return {"drift": True, "drop_pct": drift_pct}
return {"drift": False, "drop_pct": drift_pct}

Key Decisions

Decision Recommendation

Statistical method PSI for production (stable), KS for small samples

Threshold strategy Dynamic (95th percentile of historical) over static

Baseline window 7-30 days rolling window

Alert priority Performance metrics > distribution metrics

Tool stack Langfuse (traces) + Evidently/Phoenix (drift analysis)

PSI Threshold Guidelines

PSI Value Interpretation Action

< 0.1 No significant drift Monitor

0.1 - 0.25 Moderate drift Investigate

= 0.25 Significant drift Alert + Action

Anti-Patterns

❌ NEVER use static thresholds without context

if psi > 0.2: # May cause alert fatigue alert()

❌ NEVER retrain on time schedule alone

schedule.every(7).days.do(retrain) # Wasteful if no drift

✅ ALWAYS use dynamic thresholds

threshold = np.percentile(historical_psi, 95) if psi > threshold: alert()

✅ ALWAYS correlate with performance metrics

if psi > threshold AND quality_score < baseline: trigger_evaluation()

Detailed Documentation

Resource Description

references/statistical-methods.md PSI, KS, KL divergence, Wasserstein comparison

references/embedding-drift.md Arize Phoenix, cluster monitoring, semantic drift

references/ewma-baselines.md Moving averages, dynamic thresholds, control charts

references/langfuse-evidently-integration.md Combined pipeline pattern

checklists/drift-detection-setup-checklist.md Implementation checklist

Related Skills

  • langfuse-observability

  • Score tracking for drift analysis

  • llm-evaluation

  • Quality metrics that feed drift detection

  • quality-gates

  • Threshold enforcement

  • observability-monitoring

  • General monitoring patterns

Capability Details

psi-drift

Keywords: psi, population stability index, distribution drift, histogram comparison Solves:

  • Detect distribution shifts in LLM inputs/outputs

  • Production-grade drift monitoring

  • Stable drift metric for large datasets

embedding-drift

Keywords: embedding drift, semantic drift, cluster, centroid, arize phoenix Solves:

  • Detect semantic changes in text data

  • Monitor RAG retrieval quality

  • Track embedding space shifts

quality-regression

Keywords: quality drift, score degradation, trend, moving average Solves:

  • Detect LLM quality degradation over time

  • Compare against historical baselines

  • Early warning for model issues

dynamic-thresholds

Keywords: ewma, dynamic threshold, adaptive, control chart Solves:

  • Reduce alert fatigue with adaptive thresholds

  • Statistical process control for LLMs

  • Context-aware drift alerting

canary-monitoring

Keywords: canary prompt, fixed test, regression test, behavioral drift Solves:

  • Track consistency with fixed test inputs

  • Detect behavioral changes in LLMs

  • Regression testing for model updates

Source Transparency

This detail page is rendered from real SKILL.md content. Trust labels are metadata-based hints, not a safety guarantee.

Related Skills

Related by shared tags or category signals.

General

ui-components

No summary provided by upstream source.

Repository SourceNeeds Review
General

responsive-patterns

No summary provided by upstream source.

Repository SourceNeeds Review
General

domain-driven-design

No summary provided by upstream source.

Repository SourceNeeds Review