Langfuse Observability
Overview
Langfuse is the open-source LLM observability platform that OrchestKit uses for tracing, monitoring, evaluation, and prompt management. Unlike LangSmith (deprecated), Langfuse is self-hosted, free, and designed for production LLM applications.
When to use this skill:
-
Setting up LLM observability from scratch
-
Debugging slow or incorrect LLM responses
-
Tracking token usage and costs
-
Managing prompts in production
-
Evaluating LLM output quality
-
Migrating from LangSmith to Langfuse
OrchestKit Integration:
-
Status: Migrated from LangSmith (Dec 2025)
-
Location: backend/app/shared/services/langfuse/
-
MCP Server: orchestkit-langfuse (optional)
Quick Start
Setup
backend/app/shared/services/langfuse/client.py
from langfuse import Langfuse from app.core.config import settings
langfuse_client = Langfuse( public_key=settings.LANGFUSE_PUBLIC_KEY, secret_key=settings.LANGFUSE_SECRET_KEY, host=settings.LANGFUSE_HOST # Self-hosted or cloud )
Basic Tracing with @observe
from langfuse.decorators import observe, langfuse_context
@observe() # Automatic tracing async def analyze_content(content: str): langfuse_context.update_current_observation( metadata={"content_length": len(content)} ) return await llm.generate(content)
Session & User Tracking
langfuse.trace( name="analysis", user_id="user_123", session_id="session_abc", metadata={"content_type": "article", "agent_count": 8}, tags=["production", "orchestkit"] )
Core Features Summary
Feature Description Reference
Distributed Tracing Track LLM calls with parent-child spans references/tracing-setup.md
Cost Tracking Automatic token & cost calculation references/cost-tracking.md
Prompt Management Version control for prompts references/prompt-management.md
LLM Evaluation Custom scoring with G-Eval references/evaluation-scores.md
Session Tracking Group related traces references/session-tracking.md
Experiments API A/B testing & benchmarks references/experiments-api.md
Multi-Judge Eval Ensemble LLM evaluation references/multi-judge-evaluation.md
References
Tracing Setup
See: references/tracing-setup.md
Key topics covered:
-
Initializing Langfuse client with @observe decorator
-
Creating nested traces and spans
-
Tracking LLM generations with metadata
-
LangChain/LangGraph CallbackHandler integration
-
Workflow integration patterns
Cost Tracking
See: references/cost-tracking.md
Key topics covered:
-
Automatic cost calculation from token usage
-
Custom model pricing configuration
-
Monitoring dashboard SQL queries
-
Cost tracking per analysis/user
-
Daily cost trend analysis
Prompt Management
See: references/prompt-management.md
Key topics covered:
-
Prompt versioning and labels (production/staging/draft)
-
Template variables with Jinja2 syntax
-
A/B testing prompt versions
-
OrchestKit 4-level caching architecture (L1-L4)
-
Linking prompts to generation spans
LLM Evaluation
See: references/evaluation-scores.md
Key topics covered:
-
Custom scoring with numeric/categorical values
-
G-Eval automated quality assessment
-
Score trends and comparisons
-
Filtering traces by score thresholds
Session Tracking
See: references/session-tracking.md
Key topics covered:
-
Grouping traces by session_id
-
Multi-turn conversation tracking
-
User and metadata analytics
Experiments API
See: references/experiments-api.md
Key topics covered:
-
Creating test datasets in Langfuse
-
Running automated evaluations
-
Regression testing for LLMs
-
Benchmarking prompt versions
Multi-Judge Evaluation
See: references/multi-judge-evaluation.md
Key topics covered:
-
Multiple LLM judges for quality assessment
-
Weighted scoring across judges
-
OrchestKit langfuse_evaluators.py integration
Best Practices
-
Always use @observe decorator for automatic tracing
-
Set user_id and session_id for better analytics
-
Add meaningful metadata (content_type, analysis_id, etc.)
-
Score all production traces for quality monitoring
-
Use prompt management instead of hardcoded prompts
-
Monitor costs daily to catch spikes early
-
Create datasets for regression testing
-
Tag production vs staging traces
LangSmith Migration Notes
Key Differences:
Aspect Langfuse LangSmith
Hosting Self-hosted, open-source Cloud-only, proprietary
Cost Free Paid
Prompts Built-in management External storage needed
Decorator @observe
@traceable
External References
-
Langfuse Docs
-
Python SDK
-
Decorators Guide
-
Prompt Management
-
Self-Hosting
Related Skills
-
observability-monitoring
-
General observability patterns for metrics, logging, and alerting
-
llm-evaluation
-
Evaluation patterns that integrate with Langfuse scoring
-
llm-streaming
-
Streaming response patterns with trace instrumentation
-
prompt-caching
-
Caching strategies that reduce costs tracked by Langfuse
Key Decisions
Decision Choice Rationale
Observability platform Langfuse (not LangSmith) Open-source, self-hosted, free, built-in prompt management
Tracing approach @observe decorator Automatic, low-overhead instrumentation
Cost tracking Automatic token counting Built-in model pricing with custom overrides
Prompt management Langfuse native Version control, A/B testing, labels in one place
Capability Details
distributed-tracing
Keywords: trace, tracing, observability, span, nested, parent-child, observe Solves:
-
How do I trace LLM calls across my application?
-
How to debug slow LLM responses?
-
Track execution flow in multi-agent workflows
-
Create nested trace spans
cost-tracking
Keywords: cost, token usage, pricing, budget, spend, expense Solves:
-
How do I track LLM costs?
-
Calculate token usage and pricing
-
Monitor AI budget and spending
-
Track cost per user or session
prompt-management
Keywords: prompt version, prompt template, prompt control, prompt registry Solves:
-
How do I version control prompts?
-
Manage prompts in production
-
A/B test different prompt versions
-
Link prompts to traces
llm-evaluation
Keywords: score, quality, evaluation, rating, assessment, g-eval Solves:
-
How do I evaluate LLM output quality?
-
Score responses with custom metrics
-
Track quality trends over time
-
Compare prompt versions by quality
session-tracking
Keywords: session, user tracking, conversation, group traces Solves:
-
How do I group related traces?
-
Track multi-turn conversations
-
Monitor per-user performance
-
Organize traces by session
langchain-integration
Keywords: langchain, callback, handler, langgraph integration Solves:
-
How do I integrate Langfuse with LangChain?
-
Use CallbackHandler for tracing
-
Automatic LangGraph workflow tracing
-
LangChain observability setup
datasets-evaluation
Keywords: dataset, test set, evaluation dataset, benchmark Solves:
-
How do I create test datasets in Langfuse?
-
Run automated evaluations
-
Regression testing for LLMs
-
Benchmark prompt versions
ab-testing
Keywords: a/b test, experiment, compare prompts, variant testing Solves:
-
How do I A/B test prompts?
-
Compare two prompt versions
-
Experimental prompt evaluation
-
Statistical prompt testing
monitoring-dashboard
Keywords: dashboard, analytics, metrics, monitoring, queries Solves:
-
What are the most expensive traces?
-
Average cost by agent type
-
Quality score trends
-
Custom monitoring queries
orchestkit-integration
Keywords: orchestkit, migration, setup, workflow integration Solves:
-
How does OrchestKit use Langfuse?
-
Migrate from LangSmith to Langfuse
-
OrchestKit workflow tracing patterns
-
Cost tracking per analysis
multi-judge-evaluation
Keywords: multi judge, g-eval, multiple evaluators, ensemble evaluation, weighted scoring Solves:
-
How do I use multiple LLM judges to evaluate quality?
-
Set up G-Eval criteria evaluation
-
Configure weighted scoring across judges
-
Wire OrchestKit's existing langfuse_evaluators.py
experiments-api
Keywords: experiment, dataset, benchmark, regression test, prompt testing Solves:
-
How do I run experiments across datasets?
-
A/B test models and prompts systematically
-
Track quality regression over time
-
Compare experiment results