India's Most Futuristic AI Conference Is Back – Bigger, Sharper, Bolder

Machine Learning

Agent Observability with LangSmith, Langfuse, and Arize: A Hands-On Comparison

Riya Bansal Last Updated : 03 Jun, 2026

8 min read

Your AI agent works great in testing. Then you ship it, and something kinda breaks. A tool called loops forever, like it never learns. A retrieval step returns garbage and costs spike. You have no idea why, at all.

That’s the agent observability problem. And if you’re building with LLMs, you need to solve it before production, not after. This post kinda breaks down three of the most-used observability tools: LangSmith, Langfuse and Arize. We’ll set each one up, trace the same agent and compare what you actually get.

Table of contents

What is Agent Observability?
Setting Up the Test Agent
LangSmith: Native Langchain Tracing
Langfuse: Open Source and Framework-Agnostic
Arize: Production-Grade ML Observability
Which Should You Pick for Agent Observability?
Conclusion

What is Agent Observability?

Traditional application monitoring tracks requests, errors, and latency, but that is not enough for Agents.

An Agent may call multiple tools in sequence, with each LLM step having its own prompt, token usage, latency, and potential failure point. A single failed retrieval or tool call can lead to an incorrect final response.

Agent observability captures the full execution graph: every step, decision, LLM input and output, tool call, arguments, results, token usage, latency, and evaluation score. Without this visibility, debugging agent behavior becomes guesswork.

Setting Up the Test Agent

We will utilize a very simple LangChain agent to compare them. The agent receives a question from the user, retrieves relevant context, and responds using one or more tools to provide an answer.

First, you need to create the test agent and for that install all the required libraries.

Dependencies list

Let’s look at the base agent with two methods (search_docs and get_order_status). This will act as our foundational base for comparison with the three observability tools.

"""
Base agent used across all three observability demos.

Swap the OPENAI_API_KEY env var or call build_agent() from any demo file.
"""

import os

from dotenv import load_dotenv
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain.tools import tool
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_openai import ChatOpenAI

load_dotenv()


@tool
def search_docs(query: str) -> str:
    """Search internal docs for relevant information."""
    # Simulated retrieval — swap with your actual vector store
    docs = {
        "refund": (
            "Refunds are processed within 5-7 business days. "
            "Items must be returned within 30 days."
        ),
        "shipping": (
            "Standard shipping takes 3-5 business days. "
            "Express is 1-2 days."
        ),
        "account": (
            "You can reset your password via the login page. "
            "Contact support for account issues."
        ),
    }

    for keyword, content in docs.items():
        if keyword in query.lower():
            return content

    return f"Found general docs related to: {query}"


@tool
def get_order_status(order_id: str) -> str:
    """Look up the status of an order by ID."""
    # Simulated order lookup
    statuses = {
        "ORD-001": "Shipped — expected delivery 2026-05-30",
        "ORD-002": "Processing — not yet shipped",
        "ORD-003": "Delivered on 2026-05-25",
    }

    return statuses.get(
        order_id,
        f"Order {order_id} not found in the system.",
    )


def build_agent() -> AgentExecutor:
    llm = ChatOpenAI(
        model="gpt-4o",
        temperature=0,
        api_key=os.environ["OPENAI_API_KEY"],
    )

    tools = [search_docs, get_order_status]

    prompt = ChatPromptTemplate.from_messages(
        [
            (
                "system",
                "You are a helpful customer support assistant. "
                "Use tools when needed.",
            ),
            ("user", "{input}"),
            MessagesPlaceholder(variable_name="agent_scratchpad"),
        ]
    )

    agent = create_openai_tools_agent(llm, tools, prompt)

    return AgentExecutor(
        agent=agent,
        tools=tools,
        verbose=False,
    )


TEST_QUESTIONS = [
    "What are the refund policies?",
    "What is the status of order ORD-002?",
    "How long does shipping take?",
]


if __name__ == "__main__":
    executor = build_agent()

    for question in TEST_QUESTIONS:
        print(f"\nQ: {question}")

        result = executor.invoke({"input": question})

        print(f"A: {result['output']}")

This creates a candidate agent that can also be used with each of the tools. The first tool we will explore will be the one provided by LangSmith.

LangSmith: Native Langchain Tracing

The LangChain team has developed LangSmith. If you are using LangChain, then integration will be quick and easy.

"""
LangSmith observability demo.

Setup:

pip install langsmith

Set LANGCHAIN_API_KEY in your .env file.

How it works:

LangSmith hooks into LangChain's callback system via env vars, so no code
changes are needed beyond the two os.environ lines below.
"""

import os

from dotenv import load_dotenv

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()

# Enable LangSmith tracing. These two vars are all you need.
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "agent-observability-demo"

# LANGCHAIN_API_KEY must be set in your .env or environment.


def run_with_metadata(
    executor,
    question: str,
    user_id: str = "demo-user",
):
    """Run the agent and attach per-run metadata via config."""
    return executor.invoke(
        {"input": question},
        config={
            "metadata": {
                "user_id": user_id,
                "source": "langsmith_demo",
            },
            # Optional: tag runs for filtering in the dashboard.
            "tags": ["observability-blog", "demo"],
        },
    )


def main():
    print("=== LangSmith Demo ===")
    print("Traces will appear at: https://smith.langchain.com")
    print(f"Project: {os.environ['LANGCHAIN_PROJECT']}\n")

    executor = build_agent()

    for question in TEST_QUESTIONS:
        print(f"Q: {question}")

        result = run_with_metadata(executor, question)

        print(f"A: {result['output']}\n")

    print("Done. Open LangSmith to inspect the full trace tree for each run.")


if __name__ == "__main__":
    main()

LangSmith automatically connects to LangChain’s callback system without the need for decorators or wrappers to see each run appear in your project dashboard.

What you’ll see on the dashboard:

LangSmith’s trace view shows the full agent execution tree, from the initial call to tool use, LLM responses, and final output. Each node includes inputs, outputs, and latency.

You can tag runs, add metadata, filter by outcome, save runs as datasets, and run evaluations. This is useful when improving prompts or retrieval logic.

The prompt playground is another strong feature. You can open any trace, edit the prompt inline, and rerun it to debug poor LLM performance.

LangSmith’s limitations appear at scale. The free tier has caps, and integration takes more effort if you are not using LangChain, though OpenTelemetry is supported.

Langfuse: Open Source and Framework-Agnostic

Langfuse is the open-source alternative here. You can either host it on your server, or use their cloud service. It also integrates with all frameworks like LangChain, LlamaIndex, raw OpenAI APIs, etc.

# Read this Doc-string for installing the dependencies and their setup 
"""
Langfuse observability demo.

Setup:

pip install langfuse

Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY in your .env file.

LANGFUSE_HOST defaults to https://cloud.langfuse.com; override for self-hosted.

Key differences from LangSmith:

- Callback handler is passed per-invoke for more explicit control.
- Native session grouping for multi-turn conversations.
- You can score any trace after the fact via the Langfuse client.
"""

import os

from dotenv import load_dotenv
from langfuse import Langfuse
from langfuse.callback import CallbackHandler

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()


def build_handler(
    session_id: str,
    user_id: str = "demo-user",
) -> CallbackHandler:
    return CallbackHandler(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"),
        session_id=session_id,
        user_id=user_id,
        metadata={"source": "langfuse_demo"},
        tags=["observability-blog", "demo"],
    )


def score_trace(
    trace_id: str,
    score: float,
    comment: str = "",
):
    """Add a correctness score to a trace after reviewing the output."""
    lf = Langfuse(
        public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
        secret_key=os.environ["LANGFUSE_SECRET_KEY"],
        host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"),
    )

    lf.score(
        trace_id=trace_id,
        name="correctness",
        value=score,
        comment=comment,
    )

    lf.flush()

    print(f"Scored trace {trace_id}: {score}")


def run_single_session(
    executor,
    session_id: str,
):
    """Run all test questions in a single session so they're linked in the UI."""
    handler = build_handler(session_id=session_id)
    trace_ids = []

    for question in TEST_QUESTIONS:
        print(f"Q: {question}")

        result = executor.invoke(
            {"input": question},
            config={"callbacks": [handler]},
        )

        print(f"A: {result['output']}\n")

        # handler.get_trace_id() returns the trace ID for the last run.
        trace_ids.append(handler.get_trace_id())

    # Flush ensures traces are sent before the process exits.
    # This is critical in batch jobs.
    handler.flush()

    return trace_ids


def main():
    print("=== Langfuse Demo ===")
    print(f"Dashboard: {os.getenv('LANGFUSE_HOST', 'https://cloud.langfuse.com')}\n")

    executor = build_agent()
    session_id = "demo-session-001"

    trace_ids = run_single_session(executor, session_id)

    # Example: programmatically score the first trace.
    if trace_ids and trace_ids[0]:
        print("\nScoring first trace as an example:")
        score_trace(trace_ids[0], score=0.9, comment="Answer was accurate")

    print(f"\nDone. Find all runs under session '{session_id}' in your Langfuse dashboard.")


if __name__ == "__main__":
    main()

You can pass callback handlers every run, which is a little bit more explicit than LangSmith is, but provides greater flexibility since you can assign user IDs, session IDs, and custom metadata when you invoke it.

Evaluation Workflow

Langfuse has a really good evaluation workflow as well; you can add scores after the trace has been completed.

from langfuse import Langfuse

lf = Langfuse()

# Score a specific trace by ID.
lf.score(
    trace_id="trace-abc123",
    name="correctness",
    value=0.9,
    comment="Answer was accurate but slightly verbose",
)

This works in conjunction with human reviews of the responses your team scores, allowing you to get aggregated evaluation metrics over time.

Users can organize their sessions by connecting them, so agents can easily follow conversations across multiple turns. All the traces in an individual user session are connected in the application, which allows you to follow an entire conversation in one place.

Arize: Production-Grade ML Observability

Initially developed as a platform for monitoring conventional machine learning models, Arize is now capable of observing both language models and agents. The fact that it was originally created to help teams deploy models into production at scale has remained intact.

Utilizing OpenInference

In addition to using the OpenInference standard as its measurement scheme, Arize integrates with OpenTelemetry for instrumentation. Configuring Arize is more complicated than it is for most providers.

# Read this Doc-string for installing the dependencies and their setup 
"""
Arize observability demo.

Setup:

pip install arize-otel openinference-instrumentation-langchain

Set ARIZE_SPACE_ID and ARIZE_API_KEY in your .env file.

Key differences from the others:

- Uses OpenTelemetry under the hood, so it integrates with existing OTel stacks.
- Instrumentation is global like LangSmith, not per-invoke like Langfuse.
- Best-in-class production monitoring: drift detection, cohort analysis, alerting.
- Phoenix, arize-phoenix, is the free local sibling for development use.
"""

import os

from arize.otel import register
from dotenv import load_dotenv
from openinference.instrumentation.langchain import LangChainInstrumentor

from agent_base import TEST_QUESTIONS, build_agent

load_dotenv()


def setup_arize_tracing():
    """Register Arize as the OTel tracer provider and instrument LangChain globally."""
    tracer_provider = register(
        space_id=os.environ["ARIZE_SPACE_ID"],
        api_key=os.environ["ARIZE_API_KEY"],
        project_name="agent-observability-demo",
    )

    LangChainInstrumentor().instrument(tracer_provider=tracer_provider)

    return tracer_provider


def run_with_attributes(
    executor,
    question: str,
    user_segment: str = "standard",
):
    """Run the agent and attach span attributes for cohort analysis in Arize."""
    from opentelemetry import trace

    tracer = trace.get_tracer(__name__)

    with tracer.start_as_current_span("agent_run") as span:
        span.set_attribute("user.segment", user_segment)
        span.set_attribute("query.text", question)
        span.set_attribute("demo.source", "arize_demo")

        result = executor.invoke({"input": question})

        span.set_attribute("response.text", result["output"])

        return result


def main():
    print("=== Arize Demo ===")
    print("Traces will appear at: https://app.arize.com")
    print("Project: agent-observability-demo\n")

    setup_arize_tracing()

    executor = build_agent()

    # Simulate two user segments to demonstrate cohort analysis in Arize.
    segments = ["premium", "standard", "standard"]

    for question, segment in zip(TEST_QUESTIONS, segments):
        print(f"Q: {question} [segment={segment}]")

        result = run_with_attributes(
            executor,
            question,
            user_segment=segment,
        )

        print(f"A: {result['output']}\n")

    print("Done. In Arize, use the cohort filter to compare premium vs standard responses.")
    print("Set up monitors on the Arize dashboard to alert on response quality drift.")


if __name__ == "__main__":
    main()

The instrumentation is global like that of LangSmith, but it becomes a component of OpenTelemetry’s overall measurement framework. Therefore, Arize can utilize the existing observability stack of your organization regardless of the actual framework you use (i.e., Jaeger, Grafana, etc.).

Which Should You Pick for Agent Observability?

To be completely open, there is no single right tool for all use cases; it all depends on where you are in the development cycle and what your team needs.

Feature	LangSmith	Langfuse	Arize
Setup complexity	Minimal (2 env vars)	Low (callback handler)	Most boilerplate
Framework support	LangChain-native; others via OTel	Any framework	Any framework via OTel
Self-hosting	Limited	First-class (Docker Compose)	Phoenix only (local dev)
Trace visualization	Excellent tree view	Good, session-linked	Good, OTel-standard
Evaluation / scoring	Dataset + playground	Session-level human scores	Rubric-based evals
Production monitoring	Basic	Basic	Drift, alerting, cohorts
Multi-turn / sessions	Thread-level	Native session grouping	Trace-level only
Open source	Proprietary	Fully open source	Phoenix is OSS; platform isn’t
Free tier	Limited traces/month	Generous (self-host = unlimited)	Limited
Best for	LangChain dev & iteration	Data ownership + any framework	Production-scale monitoring

Use LangSmith if you are building with LangChain and want the fastest setup for prompt debugging and iteration.
Use Langfuse if you need self-hosting, stronger data ownership, multi-framework support, or session-level tracking for conversational agents.
Use Arize when your agent is moving into production and you need monitoring, drift detection, cohorts, and alerts.

Conclusion

Agent observability is one of those things you only regret skipping after something goes wrong in production. Tracing an agent run after the fact, without any instrumentation is like debugging a distributed system with print statements.

All three tools covered here are production ready. They each have a free path in. And they each take under 30 minutes to integrate with a LangChain agent. There’s no good reason to ship an unobservable agent anymore.

Pick the tool that fits your current stage. Add scoring early, even informally. And when your agent starts doing something weird at 2am, you’ll be glad you did.

Data Science Trainee at Analytics Vidhya
I am currently working as a Data Science Trainee at Analytics Vidhya, where I focus on building data-driven solutions and applying AI/ML techniques to solve real-world business problems. My work allows me to explore advanced analytics, machine learning, and AI applications that empower organizations to make smarter, evidence-based decisions.
With a strong foundation in computer science, software development, and data analytics, I am passionate about leveraging AI to create impactful, scalable solutions that bridge the gap between technology and business.
📩 You can also reach out to me at [email protected]

AI Agents Artificial Intelligence Intermediate

Free Courses

LangChain Fundamentals

Learn LangChain fundamentals, LCEL, and LangGraph to build LLM apps.

Building a Collaborative Multi-Agent system

Build agentic QA RAG System with LangGraph & LangChain.

Building Smarter LLMs with Mamba and State Space Model

Master Mamba's state space model for LLMs: Efficient, scalable training

Building ML Pipelines using MLflow & DVC

Build ML pipelines with MLflow, DVC & deploy on AWS with CI/CD.

Understand Knowledge Bases & Memory for Agentic AI

Learn memory for agentic AI using knowledge bases and vector databases.

Responses From Readers

Become an Author

Share insights, grow your voice, and inspire the data community.

Reach a Global Audience
Share Your Expertise with the World
Build Your Brand & Audience

Join a Thriving AI Community
Level Up Your AI Game
Expand Your Influence in Genrative AI

imag

Flagship Programs

GenAI Pinnacle Program| GenAI Pinnacle Plus Program| AI/ML BlackBelt Program| Agentic AI Pioneer Program

Free Courses

Generative AI| DeepSeek| OpenAI Agent SDK| LLM Applications using Prompt Engineering| DeepSeek from Scratch| Stability.AI| SSM & MAMBA| RAG Systems using LlamaIndex| Building LLMs for Code| Python| Microsoft Excel| Machine Learning| Deep Learning| Mastering Multimodal RAG| Introduction to Transformer Model| Bagging & Boosting| Loan Prediction| Time Series Forecasting| Tableau| Business Analytics| Vibe Coding in Windsurf| Model Deployment using FastAPI| Building Data Analyst AI Agent| Getting started with OpenAI o3-mini| Introduction to Transformers and Attention Mechanisms

Popular Categories

AI Agents| Generative AI| Prompt Engineering| Generative AI Application| News| Technical Guides| AI Tools| Interview Preparation| Research Papers| Success Stories| Quiz| Use Cases| Listicles

Generative AI Tools and Techniques

GANs| VAEs| Transformers| StyleGAN| Pix2Pix| Autoencoders| GPT| BERT| Word2Vec| LSTM| Attention Mechanisms| Diffusion Models| LLMs| SLMs| Encoder Decoder Models| Prompt Engineering| LangChain| LlamaIndex| RAG| Fine-tuning| LangChain AI Agent| Multimodal Models| RNNs| DCGAN| ProGAN| Text-to-Image Models| DDPM| Document Question Answering| Imagen| T5 (Text-to-Text Transfer Transformer)| Seq2seq Models| WaveNet| Attention Is All You Need (Transformer Architecture) | WindSurf| Cursor

Popular GenAI Models

Llama 4| Llama 3.1| GPT 4.5| GPT 4.1| GPT 4o| o3-mini| Sora| DeepSeek R1| DeepSeek V3| Janus Pro| Veo 2| Gemini 2.5 Pro| Gemini 2.0| Gemma 3| Claude Sonnet 3.7| Claude 3.5 Sonnet| Phi 4| Phi 3.5| Mistral Small 3.1| Mistral NeMo| Mistral-7b| Bedrock| Vertex AI| Qwen QwQ 32B| Qwen 2| Qwen 2.5 VL| Qwen Chat| Grok 3

AI Development Frameworks

n8n| LangChain| Agent SDK| A2A by Google| SmolAgents| LangGraph| CrewAI| Agno| LangFlow| AutoGen| LlamaIndex| Swarm| AutoGPT

Data Science Tools and Techniques

Python| R| SQL| Jupyter Notebooks| TensorFlow| Scikit-learn| PyTorch| Tableau| Apache Spark| Matplotlib| Seaborn| Pandas| Hadoop| Docker| Git| Keras| Apache Kafka| AWS| NLP| Random Forest| Computer Vision| Data Visualization| Data Exploration| Big Data| Common Machine Learning Algorithms| Machine Learning| Google Data Science Agent