Ayush Thakur

Machine Learning Engineer

Ayush Thakur is a Manager, AI Engineer at Weights & Biases and a Google Developer Expert in Machine Learning. He leads open-source integrations at W&B to empower developers with industry standard MLOps and LLMOps tools. Passionate about large language models, Ayush spends his time exploring best practices, evaluation methods, and building real-world LLM applications.

In this hack session, we will learn about techniques for building, optimizing, and scaling LLM-as-a-judge evaluators with minimal human input. We will learn about the inherent bias, how to mitigate them and most importantly how to align with human preferences.

This hack-session is a fast-paced, hands-on session that shows practitioners how to turn “it works on my prompt” demos into production-ready AI systems that they can trust. Drawing on the material from LLM Apps: Evaluation (created with Google AI and Open Hands), the hack session walks you through the complete evaluation lifecycle:

Why evaluation is different for LLM-powered software – and why copying traditional unit-testing patterns fails.
Bootstrapping your first evaluation set from real or synthetic user data and annotating it for pass-fail signals that map directly to business goals.
Programmatic & heuristic checks that you can drop straight into CI/CD or guardrails to catch regressions instantly.
LLM-as-a-Judge evaluators – designing prompts, scoring rubrics and structured outputs so the model grades itself.
Alignment & bias analysis – measuring how well automated evaluators track human judgment (Cohen’s κ, Kendall τ) and mitigating position, verbosity and misinformation biases.
Scaling to advanced capabilities – evaluating tool-use chains, image/video generation (Imagen, Veo) and fully agentic systems with Open Hands – all tracked and visualised in Weave dashboards.

Attendees leave with an evaluation playbook, starter notebooks, and an intuition for when to combine humans, rules and LLM judges to hit reliability targets without slowing iteration.

Managing and scaling ML workloads have never been a bigger challenge in the past. Data scientists are looking for collaboration, building, training, and re-iterating thousands of AI experiments. On the flip side ML engineers are looking for distributed training, artifact management, and automated deployment for high performance

View all speakers

Ayush Thakur

Hack Sessions The Missing Piece of AI Apps: Evaluation Ayush Thakur Machine Learning Engineer

Keynote 10:00 - 11.30AM Generative AI and I – Understanding what the new iPhone moment means to us Arnav Garg Data scientist at Fractal Arnav Garg Data scientist at Fractal

Powertalk 10:00 - 11.30AM • AUDI 1 Generative AI and I – Understanding what the new iPhone moment means to us Arnav Garg Data scientist at Fractal

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie

lidc

bcookie

aam_uuid

UserMatchHistory

li_sugr

Microsoft (2)

MR

ANONCHK

04

10

19

48