Ayush Thakur

Ayush Thakur

Machine Learning Engineer

Weights & Biases

Ayush Thakur is a Manager, AI Engineer at Weights & Biases and a Google Developer Expert in Machine Learning. He leads open-source integrations at W&B to empower developers with industry standard MLOps and LLMOps tools. Passionate about large language models, Ayush spends his time exploring best practices, evaluation methods, and building real-world LLM applications.

In this hack session, we will learn about techniques for building, optimizing, and scaling LLM-as-a-judge evaluators with minimal human input. We will learn about the inherent bias, how to mitigate them and most importantly how to align with human preferences.

This hack-session is a fast-paced, hands-on session that shows practitioners how to turn “it works on my prompt” demos into production-ready AI systems that they can trust. Drawing on the material from LLM Apps: Evaluation (created with Google AI and Open Hands), the hack session walks you through the complete evaluation lifecycle:

  1. Why evaluation is different for LLM-powered software – and why copying traditional unit-testing patterns fails.
  2. Bootstrapping your first evaluation set from real or synthetic user data and annotating it for pass-fail signals that map directly to business goals.
  3. Programmatic & heuristic checks that you can drop straight into CI/CD or guardrails to catch regressions instantly.
  4. LLM-as-a-Judge evaluators – designing prompts, scoring rubrics and structured outputs so the model grades itself.
  5. Alignment & bias analysis – measuring how well automated evaluators track human judgment (Cohen’s κ, Kendall τ) and mitigating position, verbosity and misinformation biases.
  6. Scaling to advanced capabilities – evaluating tool-use chains, image/video generation (Imagen, Veo) and fully agentic systems with Open Hands – all tracked and visualised in Weave dashboards.

Attendees leave with an evaluation playbook, starter notebooks, and an intuition for when to combine humans, rules and LLM judges to hit reliability targets without slowing iteration.

Read More

Managing and scaling ML workloads have never been a bigger challenge in the past. Data scientists are looking for collaboration, building, training, and re-iterating thousands of AI experiments. On the flip side ML engineers are looking for distributed training, artifact management, and automated deployment for high performance

Read More

Managing and scaling ML workloads have never been a bigger challenge in the past. Data scientists are looking for collaboration, building, training, and re-iterating thousands of AI experiments. On the flip side ML engineers are looking for distributed training, artifact management, and automated deployment for high performance

Read More