📊 Observability & Monitoring¶
Tools for tracking, debugging, and optimizing LLM applications and AI systems.
📋 Table of Contents¶
Overview¶
Observability tools provide visibility into AI systems through: - LLM Tracing: Track every model call and token usage - Cost Monitoring: Measure and optimize API spend - Performance: Latency, throughput, error rates - Debugging: Prompt engineering and chain analysis - Analytics: User patterns and model behavior
Tools List¶
| Repo | Description | Stars |
|---|---|---|
| langfuse/langfuse | Open-source LLM engineering platform | |
| Helicone/helicone | Open-source LLM observability for developers | |
| wandb/wandb | Experiment tracking for ML and LLMs | |
| openlit/openlit | OpenTelemetry-native LLM observability | |
| whylabs/whylogs | Data and ML monitoring library | |
| gantman/phoenix | ML observability in a notebook | |
| traceloop/openllmetry | OpenTelemetry for LLMs |
Note: LangSmith (from LangChain) is also popular but partially proprietary.
Selection Guide¶
By Use Case¶
🔍 LLM Application Debugging - Langfuse - Best traces, prompt management - Helicone - Simple integration, fast - OpenLit - OpenTelemetry compatible - LangSmith - If using LangChain
💰 Cost Optimization - Langfuse - Detailed cost tracking per user - Helicone - Real-time cost alerts - OpenLIT - Token usage analytics
📈 Experiment Tracking - Weights & Biases - Gold standard for ML - Langfuse - Prompt versioning - Phoenix - Notebook-based analysis
🏢 Enterprise Compliance - Langfuse - Self-hosted, GDPR compliant - OpenLit - Open standards, portable - WhyLogs - Data privacy focused
By Team Size¶
Solo Developer - Helicone - Fastest setup - OpenLit - Free, comprehensive - Langfuse Cloud - Free tier
Small Team (2-10) - Langfuse - Team collaboration - W&B - Experiment management - Helicone - Cost tracking
Enterprise - Langfuse Self-Hosted - Full control - W&B Teams - Advanced features - OpenLit - Custom infrastructure
Quick Start¶
Langfuse - Comprehensive Observability¶
from langfuse import Langfuse
from openai import OpenAI
# Initialize
langfuse = Langfuse(
public_key="pk-...",
secret_key="sk-...",
host="https://cloud.langfuse.com" # or self-hosted URL
)
# Wrap OpenAI client
openai = OpenAI()
# Create trace
trace = langfuse.trace(name="chat-completion")
# Track generation
generation = trace.generation(
name="openai-call",
model="gpt-4",
input={"prompt": "What is AI?"},
)
response = openai.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What is AI?"}]
)
# Update with response
generation.end(
output=response.choices[0].message.content,
usage={
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens
}
)
LangChain Integration:
from langfuse.callback import CallbackHandler
# Add to chain
handler = CallbackHandler(
public_key="pk-...",
secret_key="sk-..."
)
llm = OpenAI(callbacks=[handler])
chain = LLMChain(llm=llm, prompt=prompt, callbacks=[handler])
chain.run("query")
Helicone - Simple Proxy¶
from openai import OpenAI
# Just change the base URL!
client = OpenAI(
api_key="your-openai-key",
base_url="https://oai.helicone.ai/v1",
default_headers={
"Helicone-Auth": "Bearer YOUR_HELICONE_API_KEY"
}
)
# Use normally - automatically tracked
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}],
# Optional: Add metadata
extra_headers={
"Helicone-User-Id": "user123",
"Helicone-Session-Id": "session456",
"Helicone-Property-Environment": "production"
}
)
Cost Tracking:
# Helicone automatically calculates costs
# View in dashboard: helicone.ai/dashboard
# Set budget alerts
# Settings → Alerts → Budget threshold
Weights & Biases - Experiment Tracking¶
import wandb
from openai import OpenAI
# Initialize
wandb.init(
project="llm-experiments",
config={
"model": "gpt-4",
"temperature": 0.7,
"max_tokens": 1000
}
)
client = OpenAI()
# Run experiment
for prompt in test_prompts:
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": prompt}],
temperature=wandb.config.temperature
)
# Log metrics
wandb.log({
"prompt_length": len(prompt),
"response_length": len(response.choices[0].message.content),
"tokens_used": response.usage.total_tokens,
"cost": response.usage.total_tokens * 0.00003 # GPT-4 pricing
})
# Finish
wandb.finish()
OpenLit - OpenTelemetry Native¶
import openlit
from openai import OpenAI
# Auto-instrument (monitors all OpenAI calls)
openlit.init(
otlp_endpoint="http://localhost:4318" # Your OTLP collector
)
# Use OpenAI normally - automatically traced
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "Hello!"}]
)
# Metrics sent to your observability stack:
# - Prometheus
# - Grafana
# - Jaeger
# - Any OTLP-compatible backend
Supported Frameworks: - OpenAI, Anthropic, Cohere, HuggingFace - LangChain, LlamaIndex - ChromaDB, Pinecone, Qdrant
Phoenix - Notebook Analysis¶
import phoenix as px
from phoenix.trace.openai import OpenAIInstrumentor
# Launch Phoenix
session = px.launch_app()
# Instrument OpenAI
OpenAIInstrumentor().instrument()
# Run your LLM code
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(...)
# View traces in notebook
px.active_session().view()
# Or open UI
# http://localhost:6006
Feature Comparison¶
| Feature | Langfuse | Helicone | W&B | OpenLit | Phoenix |
|---|---|---|---|---|---|
| LLM Tracing | ✅ | ✅ | ⚠️ | ✅ | ✅ |
| Cost Tracking | ✅ | ✅ | ⚠️ | ✅ | ⚠️ |
| Prompt Management | ✅ | ⚠️ | ❌ | ❌ | ❌ |
| A/B Testing | ✅ | ⚠️ | ✅ | ❌ | ❌ |
| User Analytics | ✅ | ✅ | ⚠️ | ⚠️ | ⚠️ |
| Self-Hosted | ✅ | ✅ | ⚠️ | ✅ | ✅ |
| Cloud Option | ✅ | ✅ | ✅ | ❌ | ❌ |
| Real-time | ✅ | ✅ | ✅ | ✅ | ✅ |
| Annotations | ✅ | ⚠️ | ✅ | ❌ | ✅ |
| OpenTelemetry | ⚠️ | ❌ | ❌ | ✅ | ✅ |
✅ Full support | ⚠️ Partial support | ❌ Not supported
Metrics to Track¶
1. Cost Metrics¶
# Track per request
cost_per_request = (prompt_tokens * input_price +
completion_tokens * output_price)
# Track per user
monthly_cost_by_user = sum(costs) group by user_id
# Track per feature
cost_by_feature = sum(costs) group by feature_tag
# Set alerts
if monthly_cost > budget:
send_alert()
2. Performance Metrics¶
# Latency
time_to_first_token # Streaming
total_response_time
# Throughput
requests_per_second
tokens_per_second
# Error rate
error_rate = errors / total_requests
3. Quality Metrics¶
# User feedback
user_rating # 1-5 stars
thumbs_up_down
# Model behavior
hallucination_rate
refusal_rate # Safety refusals
average_response_length
4. Token Metrics¶
# Usage
total_tokens = prompt_tokens + completion_tokens
tokens_per_user
tokens_per_session
# Efficiency
output_input_ratio = completion_tokens / prompt_tokens
Advanced Patterns¶
1. Prompt Versioning (Langfuse)¶
from langfuse import Langfuse
langfuse = Langfuse()
# Create prompt template
langfuse.create_prompt(
name="summarization",
prompt="Summarize this text in {{max_words}} words:\n\n{{text}}",
version=1
)
# Use in production
prompt = langfuse.get_prompt("summarization", version=1)
formatted = prompt.compile(max_words=50, text=document)
# Later: Deploy v2
langfuse.create_prompt(
name="summarization",
prompt="Provide a {{length}} summary of:\n\n{{text}}",
version=2
)
# A/B test versions
2. Session Tracking¶
# Langfuse example
trace = langfuse.trace(
name="customer-support-chat",
user_id="user123",
session_id="session456",
metadata={
"channel": "web",
"product": "premium"
}
)
# Track entire conversation
for turn in conversation:
generation = trace.generation(
name=f"turn-{turn.id}",
input=turn.user_message,
output=turn.assistant_message
)
3. Cost Alerts (Helicone)¶
# Set up webhooks for budget alerts
# Dashboard → Settings → Webhooks
# Receive POST when threshold crossed
{
"alert_type": "budget_exceeded",
"threshold": 1000,
"current_spend": 1250,
"period": "monthly"
}
4. Custom Metrics (W&B)¶
import wandb
# Log custom evaluation metrics
wandb.log({
"accuracy": evaluate_accuracy(predictions),
"coherence": evaluate_coherence(responses),
"toxicity": evaluate_toxicity(responses),
"custom_metric": your_metric_function()
})
# Compare across runs
wandb.log({"metric": value}, step=iteration)
Self-Hosting Guide¶
Langfuse (Docker)¶
# Clone
git clone https://github.com/langfuse/langfuse
cd langfuse
# Configure
cp .env.example .env
# Edit .env with database credentials
# Run
docker-compose up -d
# Access at http://localhost:3000
Helicone (Docker)¶
docker run -d \
-p 3000:3000 \
-e DATABASE_URL=postgresql://... \
-e OPENAI_API_KEY=your-key \
helicone/helicone:latest
OpenLit (with Grafana)¶
# docker-compose.yml
version: '3'
services:
otel-collector:
image: otel/opentelemetry-collector
ports:
- "4318:4318"
volumes:
- ./otel-config.yaml:/etc/otel-collector-config.yaml
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3001:3000"
environment:
- GF_AUTH_ANONYMOUS_ENABLED=true
Dashboard Examples¶
Cost Dashboard (Langfuse)¶
┌─────────────────────────────────┐
│ Monthly Cost: $2,450 │
│ Budget: $3,000 (82% used) │
├─────────────────────────────────┤
│ Top Users by Cost: │
│ 1. user@company.com $450 │
│ 2. team@company.com $320 │
│ 3. dev@company.com $180 │
├─────────────────────────────────┤
│ Cost by Model: │
│ GPT-4: $1,800 (73%) │
│ Claude Sonnet: $500 (20%) │
│ GPT-3.5: $150 (7%) │
└─────────────────────────────────┘
Performance Dashboard¶
┌─────────────────────────────────┐
│ Latency (p50/p95/p99) │
│ 450ms / 1.2s / 2.5s │
├─────────────────────────────────┤
│ Throughput: 150 req/min │
│ Error Rate: 0.3% │
├─────────────────────────────────┤
│ Token Usage: │
│ Input: 1.2M tokens │
│ Output: 800K tokens │
│ Total: 2.0M tokens │
└─────────────────────────────────┘
Best Practices¶
1. Tag Everything¶
# User context
trace.update(
user_id="user123",
metadata={
"plan": "enterprise",
"org": "acme-corp",
"region": "us-west"
}
)
# Feature tracking
generation.update(
tags=["feature:summarization", "version:v2", "experiment:prompt-A"]
)
2. Sample Strategically¶
# Don't log everything in production
import random
if random.random() < 0.1: # 10% sampling
log_to_observability(trace)
# But always log errors and slow requests
if response_time > threshold or error:
log_to_observability(trace)
3. Annotate for Quality¶
# Collect user feedback
trace.score(
name="user_rating",
value=4, # 1-5 stars
comment="Helpful but verbose"
)
# Use for fine-tuning
low_rated_traces = langfuse.get_traces(
filter="user_rating < 3"
)
4. Monitor Costs in Real-Time¶
# Set up alerts
def check_budget():
today_spend = get_daily_cost()
if today_spend > DAILY_BUDGET:
alert_team()
# Optional: Pause non-critical features
disable_feature("summarization")
# Run hourly
schedule.every().hour.do(check_budget)
Integration Examples¶
LangChain + Langfuse¶
from langchain.chat_models import ChatOpenAI
from langchain.chains import LLMChain
from langfuse.callback import CallbackHandler
handler = CallbackHandler(
public_key="pk-...",
secret_key="sk-...",
session_id="session123",
user_id="user456"
)
llm = ChatOpenAI(callbacks=[handler])
chain = LLMChain(llm=llm, prompt=prompt, callbacks=[handler])
result = chain.run("query")
LlamaIndex + Helicone¶
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.llms import OpenAI
# Configure OpenAI with Helicone
llm = OpenAI(
api_base="https://oai.helicone.ai/v1",
additional_kwargs={
"headers": {
"Helicone-Auth": "Bearer YOUR_KEY"
}
}
)
service_context = ServiceContext.from_defaults(llm=llm)
index = VectorStoreIndex.from_documents(docs, service_context=service_context)
OpenAI + W&B¶
import wandb
from wandb.integration.openai import autolog
# Auto-log all OpenAI calls
autolog({"project": "gpt-4-analysis"})
# Now just use OpenAI normally
client = OpenAI()
response = client.chat.completions.create(...)
# Metrics automatically logged to W&B
Troubleshooting¶
High Costs¶
# Identify expensive operations
expensive_traces = langfuse.get_traces(
filter="total_cost > 1.0", # > $1 per request
order_by="total_cost DESC"
)
# Common causes:
# 1. Long context windows → Use RAG to reduce
# 2. High temperature → Lower for deterministic tasks
# 3. Unnecessary calls → Cache results
# 4. Wrong model → Use GPT-3.5 where possible
Slow Responses¶
# Find slow traces
slow_traces = langfuse.get_traces(
filter="latency > 5000", # > 5 seconds
order_by="latency DESC"
)
# Optimize:
# 1. Use streaming for better UX
# 2. Cache frequent queries
# 3. Use faster models (Claude Haiku, GPT-3.5)
# 4. Parallelize independent LLM calls
Missing Data¶
# Check instrumentation
assert langfuse.trace_id is not None
# Verify network
langfuse.flush() # Force send buffered traces
# Check sampling rate
langfuse.sample_rate = 1.0 # 100% for debugging
Related Resources¶
- Agents & Orchestration - Monitor agent behavior
- Vector Databases - Track vector search performance
- Developer Tools - Debug AI coding assistants
- Production AI - Production deployment patterns