Building Real-World LLM Agents: Evaluation, Optimization & Monitoring
23 August 2025 | 09:30AM - 05:30PM
About the workshop
A full-day, hands-on workshop designed to equip participants with the end-to-end knowledge and practical skills to build, evaluate, monitor, and improve Agentic AI systems. Through seven comprehensive modules, participants will explore LLM fundamentals, prompt optimization, observability, evaluation, and system productionization with real-world use cases. You will spend the entire day focusing on the following key areas:
- Understand the architecture, components, and unique challenges of LLM agents versus traditional applications.
- Learn how to implement observability and tracing to monitor and debug complex LLM systems effectively.
- Explore robust evaluation techniques, including human feedback, LLM-based scoring, and agent-specific metrics.
- Master advanced prompt engineering, tool use strategies, and optimization techniques for building capable agents.
- Apply concepts in a real-world capstone project with hands-on experience in deploying, monitoring, and improving agentic AI systems.
In this workshop, participants will work with various tools for building and optimizing LLM applications and agents. These include tracing and observability platforms for prompt tracking, token usage, and latency monitoring; evaluation frameworks using A/B testing, LLM-as-judge methods, and benchmark datasets; prompt engineering utilities for few-shot learning and chain-of-thought techniques; vector databases and RAG tools for context retrieval; caching and model optimization tools for performance tuning; agent architecture components like memory and planning modules; production monitoring dashboards for KPIs and degradation tracking; and A/B testing infrastructure for continuous improvement.
Prerequisites:
- A solid understanding of python and GenAI applications
- Familiarity with Google colab, or local python development environments
*Note: These are tentative details and are subject to change.

Modules
- Defining LLM applications vs. agents
- Key components of LLM systems
- Current landscape of LLM agent architectures
- The evaluation challenge: why traditional ML metrics fall short
- Common failure modes in production LLM systems
- Importance of observability in LLM systems
- Core observability components:
- Request/response logging
- Prompt tracking
- Token usage monitoring
- Latency measurement
- Tracing techniques for complex LLM chains
- LLM-specific observability tools and platforms
- Hands-on exercise: Setting up basic tracing in an LLM application
- Holistic evaluation framework for LLM applications
- Human evaluation approaches:
- A/B testing methodologies
- Structured feedback collection
- Automated evaluation techniques:
- LLM-as-judge evaluation
- Code-based evaluation
- Benchmark datasets for specific capabilities
- Reference-based evaluation
- Agent-specific evaluation considerations:
- Planning capability assessment
- Tool use effectiveness
- Multi-step reasoning evaluation
- Real-world workshop exercise: Designing evaluation protocols for participant use cases
- Advanced prompt engineering techniques
- Systematic prompt iteration methodology
- Context window optimization
- Few-shot learning strategies
- Chain-of-thought and reasoning techniques
- Agent-specific prompting:
- Task decomposition
- Tool use instruction
- Reflection and self-correction
- Workshop activity: Collaborative prompt optimization for common agent tasks
- Retrieval-augmented generation optimization
- Vector database tuning for context retrieval
- Caching strategies for LLM applications
- Model selection and quantization trade-offs
- Architectural optimizations for agents:
- Memory mechanisms
- Planning frameworks
- Tool integration patterns
- Comprehensive monitoring dashboard development
- Key performance indicators for LLM applications
- Detecting and responding to performance degradation
- Feedback loops for continuous improvement
- A/B testing infrastructure for LLM applications
- Agent-specific monitoring considerations:
- Tool usage patterns
- Task completion rates
- Autonomous recovery mechanisms
- Workshop exercise: Create a production HUD-style dashboard
- Putting it all together: The complete evaluation and improvement lifecycle
- Collaborative problem-solving session:
- Small groups work on real-world LLM application/agent challenges
- Apply workshop concepts to design evaluation and improvement plans
- Group presentations and feedback
- Resources for continued learning and implementation
Certificate of Participation
Receive a digital (blockchain-enabled) and physical certificate to showcase your accomplishment to the world
- Earn your certificate
- Share your achievement
