langsmith — LLM Observability, Evaluation & Prompt Management

Keyword: langsmith · llm tracing · llm evaluation · @traceable · langsmith evaluate

LangSmith is a framework-agnostic platform for developing, debugging, and deploying LLM applications. It provides end-to-end tracing, quality evaluation, prompt versioning, and production monitoring.

When to use this skill

Add tracing to any LLM pipeline (OpenAI, Anthropic, LangChain, custom models)
Run offline evaluations with evaluate() against a curated dataset
Set up production monitoring and online evaluation
Manage and version prompts in the Prompt Hub
Create datasets for regression testing and benchmarking
Attach human or automated feedback to traces
Use LLM-as-judge scoring with openevals
Debug agent failures with end-to-end trace inspection

Instructions

Install SDK: pip install -U langsmith (Python) or npm install langsmith (TypeScript)
Set environment variables: LANGSMITH_TRACING=true , LANGSMITH_API_KEY=lsv2_...
Instrument with @traceable decorator or wrap_openai() wrapper
View traces at smith.langchain.com
For evaluation setup, see references/python-sdk.md
For CLI commands, see references/cli.md
Run bash scripts/setup.sh to auto-configure environment

API Key: Get from smith.langchain.com → Settings → API Keys Docs: https://docs.langchain.com/langsmith

Quick Start

Python

pip install -U langsmith openai export LANGSMITH_TRACING=true export LANGSMITH_API_KEY="lsv2_..." export OPENAI_API_KEY="sk-..."

from langsmith import traceable from langsmith.wrappers import wrap_openai from openai import OpenAI

client = wrap_openai(OpenAI())

@traceable def rag_pipeline(question: str) -> str: """Automatically traced in LangSmith""" response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": question}] ) return response.choices[0].message.content

result = rag_pipeline("What is LangSmith?")

TypeScript

npm install langsmith openai export LANGSMITH_TRACING=true export LANGSMITH_API_KEY="lsv2_..."

import { traceable } from "langsmith/traceable"; import { wrapOpenAI } from "langsmith/wrappers"; import { OpenAI } from "openai";

const client = wrapOpenAI(new OpenAI());

await pipeline("What is LangSmith?");

Core Concepts

Concept Description

Run Individual operation (LLM call, tool call, retrieval). The fundamental unit.

Trace All runs from a single user request, linked by trace_id .

Thread Multiple traces in a conversation, linked by session_id or thread_id .

Project Container grouping related traces (set via LANGSMITH_PROJECT ).

Dataset Collection of {inputs, outputs} examples for offline evaluation.

Experiment Result set from running evaluate() against a dataset.

Feedback Score/label attached to a run — numeric, categorical, or freeform.

Tracing

@traceable decorator (Python)

from langsmith import traceable

@traceable( run_type="chain", # llm | chain | tool | retriever | embedding name="My Pipeline", tags=["production", "v2"], metadata={"version": "2.1", "env": "prod"}, project_name="my-project" ) def pipeline(question: str) -> str: return generate_answer(question)

Selective tracing context

import langsmith as ls

Enable tracing for this block only

with ls.tracing_context(enabled=True, project_name="debug"): result = chain.invoke({"input": "..."})

Disable tracing despite LANGSMITH_TRACING=true

with ls.tracing_context(enabled=False): result = chain.invoke({"input": "..."})

Wrap provider clients

from langsmith.wrappers import wrap_openai, wrap_anthropic from openai import OpenAI import anthropic

openai_client = wrap_openai(OpenAI()) # All calls auto-traced anthropic_client = wrap_anthropic(anthropic.Anthropic())

Distributed tracing (microservices)

from langsmith.run_helpers import get_current_run_tree import langsmith

@langsmith.traceable def service_a(inputs): rt = get_current_run_tree() headers = rt.to_headers() # Pass to child service return call_service_b(headers=headers)

@langsmith.traceable def service_b(x, headers): with langsmith.tracing_context(parent=headers): return process(x)

Evaluation

Basic evaluation with evaluate()

from langsmith import Client from langsmith.wrappers import wrap_openai from openai import OpenAI

client = Client() oai = wrap_openai(OpenAI())

1. Create dataset

dataset = client.create_dataset("Geography QA") client.create_examples( dataset_id=dataset.id, examples=[ {"inputs": {"q": "Capital of France?"}, "outputs": {"a": "Paris"}}, {"inputs": {"q": "Capital of Germany?"}, "outputs": {"a": "Berlin"}}, ] )

2. Target function

def target(inputs: dict) -> dict: res = oai.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": inputs["q"]}] ) return {"a": res.choices[0].message.content}

3. Evaluator

def exact_match(inputs, outputs, reference_outputs): return outputs["a"].strip().lower() == reference_outputs["a"].strip().lower()

4. Run experiment

results = client.evaluate( target, data="Geography QA", evaluators=[exact_match], experiment_prefix="gpt-4o-mini-v1", max_concurrency=4 )

LLM-as-judge with openevals

pip install -U openevals

from openevals.llm import create_llm_as_judge from openevals.prompts import CORRECTNESS_PROMPT

judge = create_llm_as_judge( prompt=CORRECTNESS_PROMPT, model="openai:o3-mini", feedback_key="correctness", )

results = client.evaluate(target, data="my-dataset", evaluators=[judge])

Evaluation types

Type When to use

Code/Heuristic Exact match, format checks, rule-based

LLM-as-judge Subjective quality, safety, reference-free

Human Annotation queues, pairwise comparison

Pairwise Compare two app versions

Online Production traces, real traffic

Prompt Hub

from langsmith import Client from langchain_core.prompts import ChatPromptTemplate

client = Client()

Push a prompt

prompt = ChatPromptTemplate([ ("system", "You are a helpful assistant."), ("user", "{question}"), ]) client.push_prompt("my-assistant-prompt", object=prompt)

Pull and use

prompt = client.pull_prompt("my-assistant-prompt")

Pull specific version:

prompt = client.pull_prompt("my-assistant-prompt:abc123")

Feedback

from langsmith import Client import uuid

client = Client()

Custom run ID for later feedback linking

my_run_id = str(uuid.uuid4()) result = chain.invoke({"input": "..."}, {"run_id": my_run_id})

Attach feedback

client.create_feedback( key="correctness", score=1, # 0-1 numeric or categorical run_id=my_run_id, comment="Accurate and concise" )

References

Python SDK Reference — full Client API, @traceable signature, evaluate()
TypeScript SDK Reference — Client, traceable, wrappers, evaluate
CLI Reference — langsmith CLI commands
Official Docs — langchain.com/langsmith
SDK GitHub — MIT License, v0.7.17
openevals — Prebuilt LLM evaluators

langsmith

Safety Notice

Copy this and send it to your AI assistant to learn

Enable tracing for this block only

Disable tracing despite LANGSMITH_TRACING=true

1. Create dataset

2. Target function

3. Evaluator

4. Run experiment

Push a prompt

Pull and use

Pull specific version:

Custom run ID for later feedback linking

Attach feedback

Source Transparency

Related Skills

web-accessibility

database-schema-design

backend-testing

technical-writing