Efficient LLM Inference: Bridging Practice and Research

Hack Session

About the session

As generative AI becomes integral to modern life, the cost of token generation rises. This session addresses the growing demand for higher tokens-per-watt efficiency. We will discuss emerging workloads (e.g., RAG), techniques (e.g., token pruning, speculative decoding, and quantization), and the role of hardware, as well as the challenges and opportunities in bridging research and practical deployment.As generative AI becomes integral to modern life, the cost of token generation rises. This session addresses the growing demand for higher tokens-per-watt efficiency. We will discuss emerging workloads (e.g., RAG), techniques (e.g., token pruning, speculative decoding, and quantization), and the role of hardware, as well as the challenges and opportunities in bridging research and practical deployment.

Speaker

Download Brochure