The Missing Piece of AI Apps: Evaluation

About

In this hack session, we will learn about techniques for building, optimizing, and scaling LLM-as-a-judge evaluators with minimal human input. We will learn about the inherent bias, how to mitigate them and most importantly how to align with human preferences.

This hack-session is a fast-paced, hands-on session that shows practitioners how to turn “it works on my prompt” demos into production-ready AI systems that they can trust. Drawing on the material from LLM Apps: Evaluation (created with Google AI and Open Hands), the hack session walks you through the complete evaluation lifecycle:

Why evaluation is different for LLM-powered software – and why copying traditional unit-testing patterns fails.
Bootstrapping your first evaluation set from real or synthetic user data and annotating it for pass-fail signals that map directly to business goals.
Programmatic & heuristic checks that you can drop straight into CI/CD or guardrails to catch regressions instantly.
LLM-as-a-Judge evaluators – designing prompts, scoring rubrics and structured outputs so the model grades itself.
Alignment & bias analysis – measuring how well automated evaluators track human judgment (Cohen’s κ, Kendall τ) and mitigating position, verbosity and misinformation biases.
Scaling to advanced capabilities – evaluating tool-use chains, image/video generation (Imagen, Veo) and fully agentic systems with Open Hands – all tracked and visualised in Weave dashboards.

Attendees leave with an evaluation playbook, starter notebooks, and an intuition for when to combine humans, rules and LLM judges to hit reliability targets without slowing iteration.

Key Takeaways:

Evaluation is a product feature, not an after-thought – you ship confidence, not just code.
Start binary, iterate later – pass/fail signals are easier to collect, act on and align than 5-point Likert scales.
Programmatic checks ≠ outdated – regexes, word-limits and PII rules catch cheap bugs before you pay for LLM calls.
LLM-as-Judge scales domain expertise – with the right prompt + rubric, models grade hundreds of examples in minutes.
Always validate the validator – use alignment metrics (κ, τ) to track drift between automated scores and human ground truth.
Mind the biases – position, verbosity and misinformation oversight can skew scores; control studies and prompt tweaks fix them.
Visualise everything – dashboards in Weave surface failure modes, token costs and latency at a glance.
Iterate, don’t stagnate – log user feedback in production and feed it back into new evaluation data and criteria.
Online vs offline evaluation – catch issues in production

Speaker

Ayush Thakur

Machine Learning Engineer

Download Brochure

Phone Number

Email Id

I Agree to the Terms & Conditions

Send WhatsApp Updates

The Missing Piece of AI Apps: Evaluation

About

Key Takeaways:

Speaker

Ayush Thakur

Download agenda

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie

lidc

bcookie

aam_uuid

UserMatchHistory

li_sugr

Microsoft (2)

MR

ANONCHK

04

10

19

48