The Missing Piece of AI Apps: Evaluation

About

In this hack session, we will learn about techniques for building, optimizing, and scaling LLM-as-a-judge evaluators with minimal human input. We will learn about the inherent bias, how to mitigate them and most importantly how to align with human preferences.

This hack-session is a fast-paced, hands-on session that shows practitioners how to turn “it works on my prompt” demos into production-ready AI systems that they can trust. Drawing on the material from LLM Apps: Evaluation (created with Google AI and Open Hands), the hack session walks you through the complete evaluation lifecycle:

  1. Why evaluation is different for LLM-powered software – and why copying traditional unit-testing patterns fails.
  2. Bootstrapping your first evaluation set from real or synthetic user data and annotating it for pass-fail signals that map directly to business goals.
  3. Programmatic & heuristic checks that you can drop straight into CI/CD or guardrails to catch regressions instantly.
  4. LLM-as-a-Judge evaluators – designing prompts, scoring rubrics and structured outputs so the model grades itself.
  5. Alignment & bias analysis – measuring how well automated evaluators track human judgment (Cohen’s κ, Kendall τ) and mitigating position, verbosity and misinformation biases.
  6. Scaling to advanced capabilities – evaluating tool-use chains, image/video generation (Imagen, Veo) and fully agentic systems with Open Hands – all tracked and visualised in Weave dashboards.

Attendees leave with an evaluation playbook, starter notebooks, and an intuition for when to combine humans, rules and LLM judges to hit reliability targets without slowing iteration.

Key Takeaways:

  • Evaluation is a product feature, not an after-thought – you ship confidence, not just code.
  • Start binary, iterate later – pass/fail signals are easier to collect, act on and align than 5-point Likert scales.
  • Programmatic checks ≠ outdated – regexes, word-limits and PII rules catch cheap bugs before you pay for LLM calls.
  • LLM-as-Judge scales domain expertise – with the right prompt + rubric, models grade hundreds of examples in minutes.
  • Always validate the validator – use alignment metrics (κ, τ) to track drift between automated scores and human ground truth.
  • Mind the biases – position, verbosity and misinformation oversight can skew scores; control studies and prompt tweaks fix them.
  • Visualise everything – dashboards in Weave surface failure modes, token costs and latency at a glance.
  • Iterate, don’t stagnate – log user feedback in production and feed it back into new evaluation data and criteria.
  • Online vs offline evaluation – catch issues in production

Speaker

Book Tickets
Download Brochure

Download agenda