Hardik Meisheri

Hardik Meisheri

Senior Applied Scientist

About

Hardik Meisheri is a Senior Applied Scientist at Microsoft AI, with over a decade of experience in reinforcement learning and machine learning. His work spans the full arc of RL's evolution — from classical control problems like supply chain optimization and multi-agent coordination at TCS Research, to fine-tuning LLMs with RLHF for ad policy understanding at Amazon Advertising, to now building large-scale foundational models at Microsoft AI. Across these roles, he has tackled many of the core challenges that underpin today's LLM-based agents, designing effective reward signals, improving sample efficiency in complex environments, scaling multi-agent systems under real-world constraints, and aligning model behavior with production-level objectives.

Why optimizing for edge cases quietly degrades general intelligence and how to prevent it?
 
Fine-tuning and reinforcement learning have become the default tools for making LLMs safer and more useful. But these methods introduce a hidden cost: optimizing for rare, high-stakes signals often degrades the model’s broader capabilities in ways that are rarely measured before deployment.
 
This talk explores what we call the "alignment tax", and why it grows disproportionately when training signals are sparse. Safety violations, edge-case behaviors, and domain-specific exceptions are often the most critical examples in a dataset, but their rarity creates a structural imbalance: the updates meant to fix them can overwrite the very capabilities that make the model useful.
 
We will unpack the mechanism behind this effect, showing how rare signals produce outsized gradient updates that distort the model’s internal representations and erode its "logic scaffolding", the general reasoning ability underlying tasks like coding, mathematics, and structured problem solving.
 
Key areas we will cover:
 
  • Why rarity amplifies impact: the gradient dynamics that cause small slices of data to disproportionately shape model behavior
  • Three production failure modes: capability forgetting, distribution brittleness, and reward hacking masked by evaluation metrics
  • The logic scaffolding problem: why degrading general reasoning is an early warning sign of deeper system failure
  • A measurement framework: how to detect alignment tax before deployment
  • Practical mitigation strategies: including gradient isolation, model averaging, and training-time tradeoff design
  • We will also ground this through cross-domain examples: systems tuned for rare, high-stakes signals often become brittle in the common case, from anomaly-sensitive decision pipelines over-flagging legitimate inputs, to compressed detectors in scientific systems missing the very rare events they seek, to medical models maintaining headline accuracy while drifting under real-world distribution shifts.

If you are building, fine-tuning, or deploying LLM systems, this talk offers a more precise lens to understand the tradeoffs you are already making; "whether you realize it or not".

Read More →