langsmith — LLM Observability, Evaluation & Prompt Management
Keyword: langsmith · llm tracing · llm evaluation · @traceable · langsmith evaluate
LangSmith is a framework-agnostic platform for developing, debugging, and deploying LLM applications. It provides end-to-end tracing, quality evaluation, prompt versioning, and production monitoring.
When to use this skill
-
Add tracing to any LLM pipeline (OpenAI, Anthropic, LangChain, custom models)
-
Run offline evaluations with evaluate() against a curated dataset
-
Set up production monitoring and online evaluation
-
Manage and version prompts in the Prompt Hub
-
Create datasets for regression testing and benchmarking
-
Attach human or automated feedback to traces
-
Use LLM-as-judge scoring with openevals
-
Debug agent failures with end-to-end trace inspection
Instructions
-
Install SDK: pip install -U langsmith (Python) or npm install langsmith (TypeScript)
-
Set environment variables: LANGSMITH_TRACING=true , LANGSMITH_API_KEY=lsv2_...
-
Instrument with @traceable decorator or wrap_openai() wrapper
-
View traces at smith.langchain.com
-
For evaluation setup, see references/python-sdk.md
-
For CLI commands, see references/cli.md
-
Run bash scripts/setup.sh to auto-configure environment
API Key: Get from smith.langchain.com → Settings → API Keys Docs: https://docs.langchain.com/langsmith
Quick Start
Python
pip install -U langsmith openai export LANGSMITH_TRACING=true export LANGSMITH_API_KEY="lsv2_..." export OPENAI_API_KEY="sk-..."
from langsmith import traceable from langsmith.wrappers import wrap_openai from openai import OpenAI
client = wrap_openai(OpenAI())
@traceable def rag_pipeline(question: str) -> str: """Automatically traced in LangSmith""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": question}] ) return response.choices[0].message.content
result = rag_pipeline("What is LangSmith?")
TypeScript
npm install langsmith openai export LANGSMITH_TRACING=true export LANGSMITH_API_KEY="lsv2_..."
import { traceable } from "langsmith/traceable"; import { wrapOpenAI } from "langsmith/wrappers"; import { OpenAI } from "openai";
const client = wrapOpenAI(new OpenAI());
const pipeline = traceable(async (question: string): Promise<string> => { const res = await client.chat.completions.create({ model: "gpt-4o", messages: [{ role: "user", content: question }], }); return res.choices[0].message.content ?? ""; }, { name: "RAG Pipeline" });
await pipeline("What is LangSmith?");
Core Concepts
Concept Description
Run Individual operation (LLM call, tool call, retrieval). The fundamental unit.
Trace All runs from a single user request, linked by trace_id .
Thread Multiple traces in a conversation, linked by session_id or thread_id .
Project Container grouping related traces (set via LANGSMITH_PROJECT ).
Dataset Collection of {inputs, outputs} examples for offline evaluation.
Experiment Result set from running evaluate() against a dataset.
Feedback Score/label attached to a run — numeric, categorical, or freeform.
Tracing
@traceable decorator (Python)
from langsmith import traceable
@traceable( run_type="chain", # llm | chain | tool | retriever | embedding name="My Pipeline", tags=["production", "v2"], metadata={"version": "2.1", "env": "prod"}, project_name="my-project" ) def pipeline(question: str) -> str: return generate_answer(question)
Selective tracing context
import langsmith as ls
Enable tracing for this block only
with ls.tracing_context(enabled=True, project_name="debug"): result = chain.invoke({"input": "..."})
Disable tracing despite LANGSMITH_TRACING=true
with ls.tracing_context(enabled=False): result = chain.invoke({"input": "..."})
Wrap provider clients
from langsmith.wrappers import wrap_openai, wrap_anthropic from openai import OpenAI import anthropic
openai_client = wrap_openai(OpenAI()) # All calls auto-traced anthropic_client = wrap_anthropic(anthropic.Anthropic())
Distributed tracing (microservices)
from langsmith.run_helpers import get_current_run_tree import langsmith
@langsmith.traceable def service_a(inputs): rt = get_current_run_tree() headers = rt.to_headers() # Pass to child service return call_service_b(headers=headers)
@langsmith.traceable def service_b(x, headers): with langsmith.tracing_context(parent=headers): return process(x)
Evaluation
Basic evaluation with evaluate()
from langsmith import Client from langsmith.wrappers import wrap_openai from openai import OpenAI
client = Client() oai = wrap_openai(OpenAI())
1. Create dataset
dataset = client.create_dataset("Geography QA") client.create_examples( dataset_id=dataset.id, examples=[ {"inputs": {"q": "Capital of France?"}, "outputs": {"a": "Paris"}}, {"inputs": {"q": "Capital of Germany?"}, "outputs": {"a": "Berlin"}}, ] )
2. Target function
def target(inputs: dict) -> dict: res = oai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": inputs["q"]}] ) return {"a": res.choices[0].message.content}
3. Evaluator
def exact_match(inputs, outputs, reference_outputs): return outputs["a"].strip().lower() == reference_outputs["a"].strip().lower()
4. Run experiment
results = client.evaluate( target, data="Geography QA", evaluators=[exact_match], experiment_prefix="gpt-4o-mini-v1", max_concurrency=4 )
LLM-as-judge with openevals
pip install -U openevals
from openevals.llm import create_llm_as_judge from openevals.prompts import CORRECTNESS_PROMPT
judge = create_llm_as_judge( prompt=CORRECTNESS_PROMPT, model="openai:o3-mini", feedback_key="correctness", )
results = client.evaluate(target, data="my-dataset", evaluators=[judge])
Evaluation types
Type When to use
Code/Heuristic Exact match, format checks, rule-based
LLM-as-judge Subjective quality, safety, reference-free
Human Annotation queues, pairwise comparison
Pairwise Compare two app versions
Online Production traces, real traffic
Prompt Hub
from langsmith import Client from langchain_core.prompts import ChatPromptTemplate
client = Client()
Push a prompt
prompt = ChatPromptTemplate([ ("system", "You are a helpful assistant."), ("user", "{question}"), ]) client.push_prompt("my-assistant-prompt", object=prompt)
Pull and use
prompt = client.pull_prompt("my-assistant-prompt")
Pull specific version:
prompt = client.pull_prompt("my-assistant-prompt:abc123")
Feedback
from langsmith import Client import uuid
client = Client()
Custom run ID for later feedback linking
my_run_id = str(uuid.uuid4()) result = chain.invoke({"input": "..."}, {"run_id": my_run_id})
Attach feedback
client.create_feedback( key="correctness", score=1, # 0-1 numeric or categorical run_id=my_run_id, comment="Accurate and concise" )
References
-
Python SDK Reference — full Client API, @traceable signature, evaluate()
-
TypeScript SDK Reference — Client, traceable, wrappers, evaluate
-
CLI Reference — langsmith CLI commands
-
Official Docs — langchain.com/langsmith
-
SDK GitHub — MIT License, v0.7.17
-
openevals — Prebuilt LLM evaluators