Phoenix - AI Observability Platform

Open-source AI observability and evaluation platform for LLM applications with tracing, evaluation, datasets, experiments, and real-time monitoring.

When to use Phoenix

Use Phoenix when:

Debugging LLM application issues with detailed traces
Running systematic evaluations on datasets
Monitoring production LLM systems in real-time
Building experiment pipelines for prompt/model comparison
Self-hosted observability without vendor lock-in

Key features:

Tracing: OpenTelemetry-based trace collection for any LLM framework
Evaluation: LLM-as-judge evaluators for quality assessment
Datasets: Versioned test sets for regression testing
Experiments: Compare prompts, models, and configurations
Playground: Interactive prompt testing with multiple models
Open-source: Self-hosted with PostgreSQL or SQLite

Use alternatives instead:

LangSmith: Managed platform with LangChain-first integration
Weights & Biases: Deep learning experiment tracking focus
Arize Cloud: Managed Phoenix with enterprise features
MLflow: General ML lifecycle, model registry focus

Quick start

Installation

pip install arize-phoenix

With specific backends

pip install arize-phoenix[embeddings] # Embedding analysis pip install arize-phoenix-otel # OpenTelemetry config pip install arize-phoenix-evals # Evaluation framework pip install arize-phoenix-client # Lightweight REST client

Launch Phoenix server

import phoenix as px

Launch in notebook (ThreadServer mode)

session = px.launch_app()

View UI

session.view() # Embedded iframe print(session.url) # http://localhost:6006

Command-line server (production)

Start Phoenix server

phoenix serve

With PostgreSQL

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host/db" phoenix serve --port 6006

Basic tracing

from phoenix.otel import register from openinference.instrumentation.openai import OpenAIInstrumentor

Configure OpenTelemetry with Phoenix

tracer_provider = register( project_name="my-llm-app", endpoint="http://localhost:6006/v1/traces" )

Instrument OpenAI SDK

OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

All OpenAI calls are now traced

from openai import OpenAI client = OpenAI() response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "Hello!"}] )

Core concepts

Traces and spans

A trace represents a complete execution flow, while spans are individual operations within that trace.

from phoenix.otel import register from opentelemetry import trace

Setup tracing

tracer_provider = register(project_name="my-app") tracer = trace.get_tracer(name)

Create custom spans

with tracer.start_as_current_span("process_query") as span: span.set_attribute("input.value", query)

# Child spans are automatically nested
with tracer.start_as_current_span("retrieve_context"):
    context = retriever.search(query)

with tracer.start_as_current_span("generate_response"):
    response = llm.generate(query, context)

span.set_attribute("output.value", response)

Projects

Projects organize related traces:

import os os.environ["PHOENIX_PROJECT_NAME"] = "production-chatbot"

Or per-trace

from phoenix.otel import register tracer_provider = register(project_name="experiment-v2")

Framework instrumentation

OpenAI

from phoenix.otel import register from openinference.instrumentation.openai import OpenAIInstrumentor

tracer_provider = register() OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

LangChain

from phoenix.otel import register from openinference.instrumentation.langchain import LangChainInstrumentor

tracer_provider = register() LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

All LangChain operations traced

from langchain_openai import ChatOpenAI llm = ChatOpenAI(model="gpt-4o") response = llm.invoke("Hello!")

LlamaIndex

from phoenix.otel import register from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

tracer_provider = register() LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

Anthropic

from phoenix.otel import register from openinference.instrumentation.anthropic import AnthropicInstrumentor

tracer_provider = register() AnthropicInstrumentor().instrument(tracer_provider=tracer_provider)

Evaluation framework

Built-in evaluators

from phoenix.evals import ( OpenAIModel, HallucinationEvaluator, RelevanceEvaluator, ToxicityEvaluator, llm_classify )

Setup model for evaluation

eval_model = OpenAIModel(model="gpt-4o")

Evaluate hallucination

hallucination_eval = HallucinationEvaluator(eval_model) results = hallucination_eval.evaluate( input="What is the capital of France?", output="The capital of France is Paris.", reference="Paris is the capital of France." )

Custom evaluators

from phoenix.evals import llm_classify

Define custom evaluation

def evaluate_helpfulness(input_text, output_text): template = """ Evaluate if the response is helpful for the given question.

Question: {input}
Response: {output}

Is this response helpful? Answer 'helpful' or 'not_helpful'.
"""

result = llm_classify(
    model=eval_model,
    template=template,
    input=input_text,
    output=output_text,
    rails=["helpful", "not_helpful"]
)
return result

Run evaluations on dataset

from phoenix import Client from phoenix.evals import run_evals

client = Client()

Get spans to evaluate

spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'" )

Run evaluations

eval_results = run_evals( dataframe=spans_df, evaluators=[ HallucinationEvaluator(eval_model), RelevanceEvaluator(eval_model) ], provide_explanation=True )

Log results back to Phoenix

client.log_evaluations(eval_results)

Datasets and experiments

Create dataset

from phoenix import Client

client = Client()

Create dataset

dataset = client.create_dataset( name="qa-test-set", description="QA evaluation dataset" )

Add examples

client.add_examples_to_dataset( dataset_name="qa-test-set", examples=[ { "input": {"question": "What is Python?"}, "output": {"answer": "A programming language"} }, { "input": {"question": "What is ML?"}, "output": {"answer": "Machine learning"} } ] )

Run experiment

from phoenix import Client from phoenix.experiments import run_experiment

client = Client()

def my_model(input_data): """Your model function.""" question = input_data["question"] return {"answer": generate_answer(question)}

def accuracy_evaluator(input_data, output, expected): """Custom evaluator.""" return { "score": 1.0 if expected["answer"].lower() in output["answer"].lower() else 0.0, "label": "correct" if expected["answer"].lower() in output["answer"].lower() else "incorrect" }

Run experiment

results = run_experiment( dataset_name="qa-test-set", task=my_model, evaluators=[accuracy_evaluator], experiment_name="baseline-v1" )

print(f"Average accuracy: {results.aggregate_metrics['accuracy']}")

Client API

Query traces and spans

from phoenix import Client

client = Client(endpoint="http://localhost:6006")

Get spans as DataFrame

spans_df = client.get_spans_dataframe( project_name="my-app", filter_condition="span_kind == 'LLM'", limit=1000 )

Get specific span

span = client.get_span(span_id="abc123")

Get trace

trace = client.get_trace(trace_id="xyz789")

Log feedback

from phoenix import Client

client = Client()

Log user feedback

client.log_annotation( span_id="abc123", name="user_rating", annotator_kind="HUMAN", score=0.8, label="helpful", metadata={"comment": "Good response"} )

Export data

Export to pandas

df = client.get_spans_dataframe(project_name="my-app")

Export traces

traces = client.list_traces(project_name="my-app")

Production deployment

Docker

docker run -p 6006:6006 arizephoenix/phoenix:latest

With PostgreSQL

Set database URL

export PHOENIX_SQL_DATABASE_URL="postgresql://user:pass@host:5432/phoenix"

Start server

phoenix serve --host 0.0.0.0 --port 6006

Environment variables

Variable Description Default

PHOENIX_PORT

HTTP server port 6006

PHOENIX_HOST

Server bind address 127.0.0.1

PHOENIX_GRPC_PORT

gRPC/OTLP port 4317

PHOENIX_SQL_DATABASE_URL

Database connection SQLite temp

PHOENIX_WORKING_DIR

Data storage directory OS temp

PHOENIX_ENABLE_AUTH

Enable authentication false

PHOENIX_SECRET

JWT signing secret Required if auth enabled

With authentication

export PHOENIX_ENABLE_AUTH=true export PHOENIX_SECRET="your-secret-key-min-32-chars" export PHOENIX_ADMIN_SECRET="admin-bootstrap-token"

phoenix serve

Best practices

Use projects: Separate traces by environment (dev/staging/prod)
Add metadata: Include user IDs, session IDs for debugging
Evaluate regularly: Run automated evaluations in CI/CD
Version datasets: Track test set changes over time
Monitor costs: Track token usage via Phoenix dashboards
Self-host: Use PostgreSQL for production deployments

Common issues

Traces not appearing:

from phoenix.otel import register

Verify endpoint

tracer_provider = register( project_name="my-app", endpoint="http://localhost:6006/v1/traces" # Correct endpoint )

Force flush

from opentelemetry import trace trace.get_tracer_provider().force_flush()

High memory in notebook:

Close session when done

session = px.launch_app()

... do work ...

session.close() px.close_app()

Database connection issues:

Verify PostgreSQL connection

psql $PHOENIX_SQL_DATABASE_URL -c "SELECT 1"

Check Phoenix logs

phoenix serve --log-level debug

References

Advanced Usage - Custom evaluators, experiments, production setup
Troubleshooting - Common issues, debugging, performance

Resources

Documentation: https://docs.arize.com/phoenix
Repository: https://github.com/Arize-ai/phoenix
Docker Hub: https://hub.docker.com/r/arizephoenix/phoenix
Version: 12.0.0+
License: Apache 2.0

phoenix-observability

Safety Notice

Copy this and send it to your AI assistant to learn

With specific backends

Launch in notebook (ThreadServer mode)

View UI

Start Phoenix server

With PostgreSQL

Configure OpenTelemetry with Phoenix

Instrument OpenAI SDK

All OpenAI calls are now traced

Setup tracing

Create custom spans

Or per-trace

All LangChain operations traced

Setup model for evaluation

Evaluate hallucination

Define custom evaluation

Get spans to evaluate

Run evaluations

Log results back to Phoenix

Create dataset

Add examples

Run experiment

Get spans as DataFrame

Get specific span

Get trace

Log user feedback

Export to pandas

Export traces

Set database URL

Start server

Verify endpoint

Force flush

Close session when done

... do work ...

Verify PostgreSQL connection

Check Phoenix logs

Source Transparency

Related Skills

ml-paper-writing

mlflow

faiss

serving-llms-vllm