Engineering Efficient LLM Inference: From Model Optimization to Scalable Systems

About the Event

Large Language Models (LLMs) have set new benchmarks in AI—but turning their power into real-world products is no small feat. The true bottleneck? Inference. Running LLMs at scale demands fleets of GPUs, deep pockets, and serious engineering chops. In this session, we’ll go under the hood of how leading AI teams are slashing inference costs and boosting performance with smart model tweaks, system-level magic, and infrastructure hacks. Whether you're building AI products or scaling existing ones, this talk will equip you with practical insights to deploy LLMs efficiently—without burning through your cloud budget.

Key Takeaways:

LLM inference is the hidden bottleneck in scaling AI applications efficiently and affordably.
Deploying LLMs at scale requires system-level innovations and model optimizations to reduce cost and latency.
Top AI companies are leveraging engineering strategies to make LLMs leaner, faster, and production-ready.
The talk demystifies real-world deployment challenges and offers insights into building sustainable, scalable AI.

About the Speaker

Rishit Dholakia

Machine Learning Engineer at Cohere

Currently a Machine Learning Engineer at Cohere, Rishit holds a Master’s in Computer Science from NYU Courant with a focus on ML and NLP. Passionate about building innovative AI products, they bring experience across retail, banking, and content creation. From developing intelligent systems for Fortune 500 clients to optimizing trading strategies with meta-ML platforms, their work blends research and real-world impact. They are driven by the elegance of mathematical algorithms that power meaningful AI solutions. You can reach him on LinkedIn.

Participate in discussion

Registration Details

2577

Registered

Flagship Programs

GenAI Pinnacle ProgramGenAI Pinnacle Plus ProgramAI/ML BlackBelt ProgramAgentic AI Pioneer Program

Popular Categories

AI AgentsGenerative AIPrompt EngineeringGenerative AI ApplicationNewsTechnical GuidesAI ToolsInterview PreparationResearch PapersSuccess StoriesQuizUse CasesListicles

AI Development Frameworks

n8nLangChainAgent SDKA2A by GoogleSmolAgentsLangGraphCrewAIAgnoLangFlowAutoGenLlamaIndexSwarmAutoGPT

Engineering Efficient LLM Inference: From Model Optimization to Scalable Systems