Building an LLM prototype is quick. A few lines of Python, a prompt, and it works. But Production is a different game altogether. You start seeing vague answers, hallucinations, latency spikes, and strange failures where the model clearly “knows” something but still gets it wrong. Since everything runs on probabilities, debugging becomes tricky. Why did a search for boots turn into shoes? The system made a choice, but you can’t easily trace the reasoning.
To tackle this, we’ll build FuseCommerce, an advanced e-commerce support system designed for visibility and control. Using Langfuse, we’ll create an agentic workflow with semantic search and intent classification, while keeping every decision transparent. In this article, we’ll turn a fragile prototype into an observable, production-ready LLM system.
Langfuse functions as an open-source platform for LLM engineering which enables teams to work together on debugging and analysing and developing their LLM applications. The platform functions as DevTools for AI agents.

The system offers three main functionalities which include:
In this project Langfuse functions as our main logging system which helps us create an automated system that enhances its own performance.
We will be developing a smart customer support representative for a technology retail business named “FuseCommerce.”
In contrast to a standard LLM wrapper, the following elements will be included:

Langfuse is the backbone of the agent being used for this work. It allows us to follow the unique steps of our agent (intent classification, retrieval, generation) and shows us how they all work together, allowing us to pinpoint where something went wrong if an answer is incorrect.
session_id` on Langfuse dashboard to allow us to replay all user interaction for context. You can quickly and easily begin the installation process for the agent.
The first thing you need to do is install the following dependencies which consist of the Langfuse SDK and Google’s Generative AI.
pip install langfuse streamlit google-generativeai python-dotenv numpy scikit-learn

After you finish installing the libraries, you will need to create a .env file where your credentials will be stored in a secure way.
GOOGLE_API_KEY=your_gemini_key
LANGFUSE_PUBLIC_KEY=pk-lf-...
LANGFUSE_SECRET_KEY=sk-lf-...
LANGFUSE_HOST=https://cloud.langfuse.com
A traditional keyword search can break down if a user uses different words, i.e., the use of synonyms. Therefore, we want to leverage Vector Embeddings to build out a semantic search engine.
Purely through math, i.e., Cosine Similarity, we will create a “meaning vector” for each of our products.
# db.py
from sklearn.metrics.pairwise import cosine_similarity
import google.generativeai as genai
def semantic_search(query):
# Create a vector representation of the query
query_embedding = genai.embed_content(
model="models/text-embedding-004",
content=query
)["embedding"]
# Using math, find the nearest meanings to the query
similarities = cosine_similarity([query_embedding], product_vectors)
return get_top_matches(similarities)
When users say “Hello,” we are able to classify user intent using a classifier so that we can avoid searching the database.
You will see that we also automatically detect input, output, and latency using the @langfuse.observe decorator. Like magic!
@langfuse.observe(as_type="generation")
def classify_user_intent(user_input):
prompt = f"""
Use the following user input to classify the user's intent into one of the three categories:
1. PRODUCT_SEARCH
2. ORDER_STATUS
3. GENERAL_CHAT
Input: {user_input}
"""
# Call Gemini model here...
intent = "PRODUCT_SEARCH" # Placeholder return value
return intent
Step 3: The Agent’s Workflow
We stitch our process together. The agent will Perceive, Get Input, Think (Classifies) and then Act (Route).
We use the method lf_client.update_current_trace to tag the conversation with metadata information such as the session_id.
@langfuse.observe() # Root Trace
def handle_customer_user_input(user_input, session_id):
# Tag the session
langfuse.update_current_trace(session_id=session_id)
# Think
intent = get_classified_intent(user_input)
# Act based on classified intent
if intent == "PRODUCT_SEARCH":
context = use_semantic_search(user_input)
elif intent == "ORDER_STATUS":
context = check_order_status(user_input)
else:
context = None # Optional fallback for GENERAL_CHAT or unknown intents
# Return the response
response = generate_ai_response(context, intent)
return response
We create an enhanced Streamlit user interface. A significant change is that feedback buttons will provide a feedback score back to Langfuse based on the individual trace ID associated with the specific user conversation.
# app.py
col1, col2 = st.columns(2)
if col1.button("👍"):
lf_client.score(trace_id=trace_id, name="user-satisfaction", value=1)
if col2.button("👎"):
lf_client.score(trace_id=trace_id, name="user-satisfaction", value=0)
Let’s take a closer look at a user’s inquiry: “Do you sell any accessories for gaming systems?”


Langfuse will create a trace view to visualize the nested hierarchy:
TRACE: agent-conversation (1.5 seconds)
Once the user clicks thumbs up, Langfuse receives a score of 1. You will have a total sum of thumbs up clicks per day to view the average daily. You also will have a cumulative visual dashboard to view:
Through our implementation of Langfuse we transformed a hidden-functioning chatbot system into an open-visible operational system. We established user trust through our development of product functions.
We proved that our agent possesses “thinking” abilities through Intent Classification while it can “understand” things through Semantic Search and it can “acquire” knowledge through user Feedback scores. This architectural design serves as the basis for contemporary AI systems which operate in real-world environments.
A. Langfuse provides tracing, metrics, and evaluation tools to debug, monitor, and improve LLM agents in production.
A. It uses intent classification to detect query type, then routes to semantic search, order lookup, or general chat logic.
A. User feedback is logged per trace, enabling performance monitoring and iterative optimization of prompts, retrieval, and routing.