Langfuse Observability

Overview

Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.

When to use this skill:

Setting up LLM observability from scratch
Debugging slow or incorrect LLM responses
Tracking token usage and costs
Managing prompts in production
Evaluating LLM output quality
Migrating from LangSmith to Langfuse

OrchestKit Integration:

Status: Migrated from LangSmith (Dec 2025)
Location: backend/app/shared/services/langfuse/
MCP Server: orchestkit-langfuse (optional)

Quick Start

Setup

backend/app/shared/services/langfuse/client.py

from langfuse import Langfuse from app.core.config import settings

langfuse_client = Langfuse( public_key=settings.LANGFUSE_PUBLIC_KEY, secret_key=settings.LANGFUSE_SECRET_KEY, host=settings.LANGFUSE_HOST # Self-hosted or cloud )

Basic Tracing with @observe

from langfuse.decorators import observe, langfuse_context

@observe() # Automatic tracing async def analyze_content(content: str): langfuse_context.update_current_observation( metadata={"content_length": len(content)} ) return await llm.generate(content)

Session & User Tracking

langfuse.trace( name="analysis", user_id="user_123", session_id="session_abc", metadata={"content_type": "article", "agent_count": 8}, tags=["production", "orchestkit"] )

Core Features Summary

Feature Description Reference

Distributed Tracing Track LLM calls with parent-child spans references/tracing-setup.md

Cost Tracking Automatic token & cost calculation references/cost-tracking.md

Prompt Management Version control for prompts references/prompt-management.md

LLM Evaluation Custom scoring with G-Eval references/evaluation-scores.md

Session Tracking Group related traces references/session-tracking.md

Experiments API A/B testing & benchmarks references/experiments-api.md

Multi-Judge Eval Ensemble LLM evaluation references/multi-judge-evaluation.md

References

Tracing Setup

See: references/tracing-setup.md

Key topics covered:

Initializing Langfuse client with @observe decorator
Creating nested traces and spans
Tracking LLM generations with metadata
LangChain/LangGraph CallbackHandler integration
Workflow integration patterns

Cost Tracking

See: references/cost-tracking.md

Key topics covered:

Automatic cost calculation from token usage
Custom model pricing configuration
Monitoring dashboard SQL queries
Cost tracking per analysis/user
Daily cost trend analysis

Prompt Management

See: references/prompt-management.md

Key topics covered:

Prompt versioning and labels (production/staging/draft)
Template variables with Jinja2 syntax
A/B testing prompt versions
OrchestKit 4-level caching architecture (L1-L4)
Linking prompts to generation spans

LLM Evaluation

See: references/evaluation-scores.md

Key topics covered:

Custom scoring with numeric/categorical values
G-Eval automated quality assessment
Score trends and comparisons
Filtering traces by score thresholds

Session Tracking

See: references/session-tracking.md

Key topics covered:

Grouping traces by session_id
Multi-turn conversation tracking
User and metadata analytics

Experiments API

See: references/experiments-api.md

Key topics covered:

Creating test datasets in Langfuse
Running automated evaluations
Regression testing for LLMs
Benchmarking prompt versions

Multi-Judge Evaluation

See: references/multi-judge-evaluation.md

Key topics covered:

Multiple LLM judges for quality assessment
Weighted scoring across judges
OrchestKit langfuse_evaluators.py integration

Best Practices

Always use @observe decorator for automatic tracing
Set user_id and session_id for better analytics
Add meaningful metadata (content_type, analysis_id, etc.)
Score all production traces for quality monitoring
Use prompt management instead of hardcoded prompts
Monitor costs daily to catch spikes early
Create datasets for regression testing
Tag production vs staging traces

LangSmith Migration Notes

Key Differences:

Aspect Langfuse LangSmith

Hosting Self-hosted, open-source Cloud-only, proprietary

Cost Free Paid

Prompts Built-in management External storage needed

Decorator @observe

@traceable

External References

Langfuse Docs
Python SDK
Decorators Guide
Prompt Management
Self-Hosting

Related Skills

observability-monitoring
General observability patterns for metrics, logging, and alerting
llm-evaluation
Evaluation patterns that integrate with Langfuse scoring
llm-streaming
Streaming response patterns with trace instrumentation
prompt-caching
Caching strategies that reduce costs tracked by Langfuse

Key Decisions

Decision Choice Rationale

Observability platform Langfuse (not LangSmith) Open-source, self-hosted, free, built-in prompt management

Tracing approach @observe decorator Automatic, low-overhead instrumentation

Cost tracking Automatic token counting Built-in model pricing with custom overrides

Prompt management Langfuse native Version control, A/B testing, labels in one place

Capability Details

distributed-tracing

Keywords: trace, tracing, observability, span, nested, parent-child, observe Solves:

How do I trace LLM calls across my application?
How to debug slow LLM responses?
Track execution flow in multi-agent workflows
Create nested trace spans

cost-tracking

Keywords: cost, token usage, pricing, budget, spend, expense Solves:

How do I track LLM costs?
Calculate token usage and pricing
Monitor AI budget and spending
Track cost per user or session

prompt-management

Keywords: prompt version, prompt template, prompt control, prompt registry Solves:

How do I version control prompts?
Manage prompts in production
A/B test different prompt versions
Link prompts to traces

llm-evaluation

Keywords: score, quality, evaluation, rating, assessment, g-eval Solves:

How do I evaluate LLM output quality?
Score responses with custom metrics
Track quality trends over time
Compare prompt versions by quality

session-tracking

Keywords: session, user tracking, conversation, group traces Solves:

How do I group related traces?
Track multi-turn conversations
Monitor per-user performance
Organize traces by session

langchain-integration

Keywords: langchain, callback, handler, langgraph integration Solves:

How do I integrate Langfuse with LangChain?
Use CallbackHandler for tracing
Automatic LangGraph workflow tracing
LangChain observability setup

datasets-evaluation

Keywords: dataset, test set, evaluation dataset, benchmark Solves:

How do I create test datasets in Langfuse?
Run automated evaluations
Regression testing for LLMs
Benchmark prompt versions

ab-testing

Keywords: a/b test, experiment, compare prompts, variant testing Solves:

How do I A/B test prompts?
Compare two prompt versions
Experimental prompt evaluation
Statistical prompt testing

monitoring-dashboard

Keywords: dashboard, analytics, metrics, monitoring, queries Solves:

What are the most expensive traces?
Average cost by agent type
Quality score trends
Custom monitoring queries

orchestkit-integration

Keywords: orchestkit, migration, setup, workflow integration Solves:

How does OrchestKit use Langfuse?
Migrate from LangSmith to Langfuse
OrchestKit workflow tracing patterns
Cost tracking per analysis

multi-judge-evaluation

Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring Solves:

How do I use multiple LLM judges to evaluate quality?
Set up G-Eval criteria evaluation
Configure weighted scoring across judges
Wire OrchestKit's existing langfuse_evaluators.py

experiments-api

Keywords: experiment, dataset, benchmark, regression test, prompt testing Solves:

How do I run experiments across datasets?
A/B test models and prompts systematically
Track quality regression over time
Compare experiment results

langfuse-observability

Safety Notice

Copy this and send it to your AI assistant to learn

backend/app/shared/services/langfuse/client.py

Source Transparency

Related Skills

domain-driven-design

dashboard-patterns

rag-retrieval

memory