Atul Ramesh Deshpande

Principal Chief Architect, Office of the CTO

Atul Deshpande is a distinguished technology leader with over two decades of experience spanning telecommunications, IT, and AI engineering. As Principal Chief Architect at Red Hat’s Field CTO Office (Global), he spearheads digital transformation initiatives for leading telecom operators, working closely with CIOs and CTOs to drive innovation in AI/ML, Gen AI, Automation, and Hybrid Cloud adoption, powered by Open Source Technologies. Currently, he’s spearheading efforts on architecting Fully Autonomous Networks with AIML, Gen AI & Agentic AI for Telcos.

Previously, Atul was Principal Architect at Quantiphi, where he led the design and deployment of cutting- edge AI solutions in collaboration with Google, delivering transformative projects for global enterprises including Disney, Orange, and Telecom Argentina. As a founding member of the Rakuten Cloud Platform (now Rakuten Symphony), he played a pivotal role in creating the Cloud-Native BSS (Project Kokoro) and the SixthSense Observability Platform—key enablers of Rakuten Mobile’s fully virtualized network. Earlier in his career, he advanced hybrid cloud adoption during his tenure at NTT Data Centers.

Atul is an inventor and patent holder for a pioneering LTE network fault detection technique using a Random Forest–based algorithm capable of identifying and predicting sleeping cells with over 80% accuracy—currently in production at Rakuten Mobile. He holds an M.Tech in Microwave Engineering and is an alumnus of IIM Calcutta. An active member of IEEE, IEEE Computer Soc, IEEE ComSoc, and ACM, Atul is passionate about collaborating with industry peers, academia, and research organizations to advance the future of AI-powered autonomous networks.

Large Language Models (LLMs) have rapidly transitioned from a nascent concept to a pervasive force in machine learning, enabling sophisticated applications from conversational AI to content generation. However, the deployment of these powerful
models at scale presents significant challenges, notably high latency and inefficient resource utilization. Traditional inference pipelines often struggle with the immense computational and memory demands of LLMs, leading to slow response times and
prohibitive operational costs. This is where vLLM, standing for Virtual Large Language Model, emerges as a transformative solution. Developed initially at the Sky Computing Lab at UC Berkeley, vLLM is an open-source library designed to optimize
and accelerate LLM inference and serving, ensuring faster and more cost-effective deployment.

vLLM primarily tackles critical scaling challenges through innovative memory management and parallelization techniques. Its flagship feature, PagedAttention, revolutionizes the management of the attention key-value (KV) cache. Unlike
traditional methods that require contiguous memory blocks, PagedAttention breaks the KV cache into non-contiguous blocks, akin to virtual memory in operating systems. This dramatically reduces memory fragmentation, allowing for more efficient
memory reuse and significantly improving throughput—by some estimates, up to 24 times higher compared to other popular open-source libraries like Hugging Face Transformers. Complementing PagedAttention is continuous batching, which
dynamically processes incoming requests, keeping the GPU highly utilized and minimizing idle compute time. These innovations collectively address memory constraints, token-by-token generation overhead, and inefficient batching that plague traditional LLM inference systems. Additional optimizations include asynchronous prefetching, optimized CUDA kernels, quantization support, and speculative decoding.

Yet as LLM use cases grow, there is a need for not just fast, but distributed, production-ready inference in cloud-native environments. Enter llm-d, a Kubernetes-native distributed LLM inference framework, launched in collaboration
with Red Hat, Google, CoreWeave, IBM, and others. llm-d is designed for:

Disaggregated serving: Splitting prefill (prompt processing) and decode (token
generation) phases across specialized workloads to optimize GPU/accelerator
use and lowest latency.
KV-cache aware routing: Enabling request scheduling that utilizes existing
cache hits, reducing redundant computation and response time.
Scalable, modular clusters: Seamlessly orchestrated within Kubernetes, llm-d
empowers enterprises to deploy LLM inference clouds that meet strict SLAs
while adapting to any infrastructure, model, or accelerator.developers.redhat+2

Join this session for a technical deep-dive: what enables the efficiency of vLLM, how llm-d operationalizes LLM inference at scale, and how this ecosystem is driving the next frontier of accessible, production-grade AI.

Managing and scaling ML workloads have never been a bigger challenge in the past. Data scientists are looking for collaboration, building, training, and re-iterating thousands of AI experiments. On the flip side ML engineers are looking for distributed training, artifact management, and automated deployment for high performance

View all speakers

Atul Ramesh Deshpande

PowerTalk vLLM + llm-d: Scalable, Efficient LLM Inference in Kubernetes Atul Ramesh Deshpande Principal Chief Architect, Office of the CTO

Keynote 10:00 - 11.30AM Generative AI and I – Understanding what the new iPhone moment means to us Arnav Garg Data scientist at Fractal Arnav Garg Data scientist at Fractal

Powertalk 10:00 - 11.30AM • AUDI 1 Generative AI and I – Understanding what the new iPhone moment means to us Arnav Garg Data scientist at Fractal

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie

lidc

bcookie

aam_uuid

UserMatchHistory

li_sugr

Microsoft (2)

MR

ANONCHK

04

10

19

48