Opening the Black Box of AI Agents: Mechanistic Interpretability

Hack Session

About the session

We are entering the era of agentic AI, where language models no longer just answer questions but autonomously plan, use tools, execute multi-step workflows, and make consequential decisions. At the same time, enterprises are moving beyond monolithic frontier models toward small, custom language models (SLMs) fine-tuned for domain-specific tasks in healthcare, finance, legal, and supply chain operations. Yet a critical gap persists: we deploy these systems with almost no understanding of how they internally store, retrieve, or reason over knowledge. When an autonomous agent hallucinates a drug interaction, when a custom SLM trained on proprietary data produces a confidently wrong answer, the cost is no longer an inconvenient chatbot response. It is a regulatory violation, a clinical risk, or a material business loss.
Mechanistic interpretability is an emerging research discipline that directly addresses this gap. Rather than treating models as opaque black boxes and relying on post-hoc explanations or behavioral benchmarks, mechanistic interpretability reverse-engineers the internal computations of neural networks to identify the specific circuits, attention heads, and MLP layers responsible for distinct behaviors. This matters more now than ever: as organizations deploy agentic systems that chain multiple model calls with tool use and memory, understanding what each component believes and why it makes a particular decision becomes foundational to building trustworthy, auditable, and regulatorily compliant AI. For SLMs and custom models, where fine-tuning can inadvertently introduce biases, overwrite safety-critical knowledge, or create unexpected failure modes, interpretability is not a research luxury. It is an operational necessity.
In this 60–90 minute hack session, we take a hands-on, code-driven journey into the internals of transformer models (GPT-2 / Pythia) using TransformerLens and PyTorch. Attendees will learn to apply the Logit Lens technique to visualize how predictions evolve layer by layer, use activation patching to isolate which components are causally responsible for a model’s output, and perform knowledge editing, surgically modifying a model’s factual recall by intervening on a single MLP layer without any fine-tuning. Each technique is demonstrated on curated datasets with fully walkthrough code, bridging the gap between cutting-edge research papers and practical, production-relevant skills.
The business implications are immediate and tangible. Mechanistic interpretability enables teams to debug hallucinations at the source rather than layering expensive guardrails on top, to audit custom models before deployment for regulatory compliance (EU AI Act, FDA guidance on AI/ML in healthcare), and to perform targeted model patching, correcting a single factual error or bias in production without the cost and latency of full retraining. For organizations building agentic workflows, it provides a principled methodology to verify that each model in the chain is reasoning correctly, not just producing plausible outputs. Attendees will walk away with an intuitive understanding of how transformers process information, a practical toolkit for probing model internals, and a clear roadmap for applying these techniques to their own production systems, custom models, and agentic architectures.

Speaker

Download Brochure