Keeping Eyes on Your Agents

Hack Session

About the session

As agentic AI systems evolve from simple prompt-response pipelines into complex, multi-step, tool-using architectures, traditional evaluation approaches fall short. End-to-end benchmarks alone cannot explain why an agent fails, while component-level metrics often miss emergent behaviours across the system. This talk introduces a multi-layered evaluation and observability framework designed to make agentic systems measurable, debuggable, and production-ready.
 
We begin with end-to-end evaluation strategies, including the design of high-quality golden datasets tailored to agent workflows, enabling reliable measurement of real-world task success. We then zoom into component-level evaluation, breaking down agent pipelines: planning, tool selection, memory usage, and reasoning, to identify failure modes with precision.
 
The session further explores observability patterns for modern AI systems, including tracing, structured logging, and instrumentation using platforms like LangFuse, as well as codeless observability via reverse proxy gateways for MCP-based services.
 
By connecting evaluation with observability, this session provides a practical blueprint for moving from opaque, brittle agents to transparent, reliable, and continuously improving AI systems. Attendees will leave with concrete techniques, architectural patterns, and mental models to evaluate and operate agentic systems at scale.

Speaker

Download Brochure