Build Reliable Agentic AI systems: Evaluation, Observability, and Safety

About the Workshop

This intensive 8-hour workshop takes engineering and data science teams from agent fundamentals to a fully deployed, evaluated, and guarded agentic system — using only open-source tools. The day is structured around six modules that build sequentially. Foundation modules (M1–M2) cover architecture and memory. The enterprise-grade modules (M3–M6) cover evaluation with DeepEval, safety enforcement with NeMo Guardrails and Llama Guard, observability with LangSmith, and a production deployment on Hugging Face Spaces with a Streamlit UI. By end of day every participant leaves with working code, configured tooling, and a deployed application. 

Prerequisites

  • Solid Python skills — classes, decorators, basic async

  • Familiarity with LLM APIs (OpenAI or equivalent)

  • Hugging Face account created before the day — instructions sent in advance

  • LangSmith and DeepEval accounts pre-created (both free tier)

  • Basic understanding of what an agent is — deep knowledge not required

Workshop Modules

  •  What makes a system genuinely agentic vs LLM-with-tools 
  •  The agent loop — Perceive → Think → Act → Observe — and where it breaks in production 
  •  Component model: LLM, tools, memory, state, router, orchestration layer 
  •  Demo: Tool calling with OpenAI function-calling API — bare metal, no frameworks 
  •  LangGraph graph model — nodes, edges, state, conditional routing 
  •  Hands-on: Build a ReAct agent, convert it to a stateful LangGraph graph 
  •  Preview: why evaluation, observability, and guardrails are non-negotiable at enterprise scale 

  •  Short-term vs long-term memory — threads, snapshots, vector stores 
  •  Context window management — summarisation, trimming, token budgeting 
  •  Hands-on: Multi-turn financial analyst agent with persistent memory (LangGraph + FAISS) 
  •  MCP deep dive — when it adds value, when it's unnecessary overhead 
  •  Hands-on: Connect agent to pre-built MCP servers (search, weather) 
  •  Hands-on: Build a custom MCP server and client from scratch 
  •  Adaptive memory: agents that update their own knowledge base across sessions 

  •  Why LLM evaluation is different from traditional ML metrics — and why it is harder 
  •  The evaluation taxonomy: faithfulness, answer relevancy, contextual precision, hallucination, toxicity 
  •  Hands-on: Set up DeepEval test suite — write your first evaluation cases from scratch 
  •  Hands-on: G-Eval with custom criteria — define what 'correct' means for your specific agent 
  •  Hands-on: Evaluate a RAG agent — retrieval quality, faithfulness, and groundedness metrics 
  •  Regression testing: pin evaluation scores and automatically fail CI/CD when they drop 
  •  Hands-on: Integrate DeepEval into a GitHub Actions pipeline — evaluation on every commit 
  •  Comparative evaluation: score multiple agent configurations against the same golden test set 
  •  Dataset management: building and versioning golden test sets that grow with the system 

  •  The guardrail taxonomy: input validation, output validation, semantic filters, policy enforcement 
  •  NVIDIA NeMo Guardrails — architecture overview, Colang policy language, rail types 
  •  Hands-on: Write a Colang policy — block off-topic queries, enforce escalation paths, define allowed topics 
  •  Hands-on: Apply NeMo input and output rails to a live LangGraph agent 
  •  Meta Llama Guard — how it works as a content safety classifier (open source, self-hosted) 
  •  Hands-on: Integrate Llama Guard as a pre-call and post-call validator in the agent pipeline 
  •  Combining NeMo + Llama Guard: layered defence — policy layer + ML safety classifier 
  •  PII detection and redaction using open-source tools — scrub sensitive data before it reaches the LLM 
  •  Red teaming your own agent — prompt injection patterns, jailbreak attempts, adversarial inputs 
  •  Hands-on: Run PromptBench adversarial suite against the guarded agent — measure what gets through 
  •  Guardrail performance tradeoffs — latency cost, false positive rates, graceful degradation strategies 

  •  What observability means for agents — beyond logs, beyond latency 
  •  The three observability layers: tracing, evaluation runs, feedback loops 
  •  Hands-on: LangSmith full setup — trace a multi-step agent, inspect every tool call and LLM call 
  •  Hands-on: Attach DeepEval scores to LangSmith traces — close the evaluation-observability loop 
  •  Hands-on: LangSmith evaluation runs — build a feedback loop from production traces 
  •  Token cost tracking — break down spend by agent component, tool, and session 
  •  Latency analysis — identify the slowest nodes in your agent graph 
  •  Hands-on: Instrument a guarded agent with structured logging — what to capture and why 
  •  Session replay: reproduce any agent failure from trace data — the production debugging workflow 
  •  Human-in-the-loop review queues: route low-confidence outputs to human review based on trace signals 

  •  Streamlit fundamentals for agent UIs — chat interface, tool call display, trace viewer 
  •  Hands-on: Wrap the full LangGraph agent (with DeepEval + NeMo + LangSmith) in a Streamlit app 
  •  UI design for agentic systems — showing reasoning steps, tool calls, and guardrail decisions to users 
  •  Hugging Face Spaces — how it works, free tier limits, public vs private spaces 
  •  Hands-on: Push the Streamlit app to a Hugging Face Space — live deployment end-to-end 
  •  Wiring LangSmith observability into the deployed Hugging Face Space 
  •  Environment secrets management on Hugging Face Spaces — API keys, tokens, no .env committed 
  •  Hands-on: Expose an API endpoint alongside the Streamlit UI using FastAPI + uvicorn inside Spaces 
  •  Updating a deployed Space — push new guardrail policies, evaluation thresholds, model changes 
  •  Discussion: production failure modes, responsible AI governance, what to build next