Building Intelligent Multimodal Agents: Integrating Vision, Speech & Language

23 August 2025 | 09:30AM - 05:30PM | RENAISSANCE :- Race Course Rd, Madhava Nagar Extension

About the workshop

In this workshop, we’ll build a fully functional multimodal Telegram agent, putting into practice a wide range of concepts from the world of Agentic AI. This isn’t just another PoC — it's designed for those who are ready to level up and build complex, production-ready agentic applications.

Throughout the session, you’ll learn how to build a Telegram agent you can chat with directly from your phone, master the creation and management of workflows with LangGraph, and set up a long-term memory system using Qdrant as a vector database. We’ll also leverage the fast LLMs served by Groq to power the agent’s responses, implement Speech-to-Text capabilities with Whisper, and integrate Text-to-Speech using ElevenLabs. Beyond language, you’ll learn to generate high-quality images using diffusion models, and process visual inputs with Vision-Language Models such as Llama 3.2 Vision.

Finally, we’ll bring it all together by connecting the complete agentic application directly to Telegram, enabling a rich, multimodal user experience. Throughout the day, you will focus on the following key areas:

Understand the full architecture and stack for building production-grade multimodal agents.
Learn to build and debug agent workflows using LangGraph and LangGraph Studio.
Implement short-term (SQLite) and long-term memory (Qdrant) systems for your agent.
Enable speech interactions using Whisper (STT) and ElevenLabs (TTS).
Integrate vision-language understanding with Llama 3.2 Vision and generate images via diffusion models.
Connect your agent to Telegram for real-time, mobile-accessible interactions.

In this workshop, participants will work hands-on with a cutting-edge stack of tools and technologies tailored for building multimodal, production-ready agentic applications. LangGraph serves as the backbone for orchestrating agent workflows, with LangGraph Studio enabling easy debugging and visualization. SQLite powers short-term memory within the agent, while Qdrant, a high-performance vector database, handles long-term memory for contextual awareness. Fast and efficient responses are delivered using Groq LLMs, complemented by natural voice interactions through Whisper for speech-to-text and ElevenLabs for text-to-speech synthesis. For visual intelligence, Llama 3.2 Vision interprets image inputs, and diffusion models are used to generate high-quality visuals. Finally, the complete system is integrated with the Telegram Bot API, allowing users to interact with the agent in real time via chat, voice, or image directly from their mobile devices.

Prerequisites:

Basic Python programming skills
Familiarity with LangChain or LangGraph
Basic understanding of multimodal AI concepts

*Note: These are tentative details and are subject to change.

Modules

We'll start by reviewing the architecture and tech stack, setting up the repository, installing dependencies, and configuring environment variables.

We'll dive into the basics of LangGraph — nodes, edges, conditional edges, state — and break down how the agent’s "brain" works. You’ll also learn how to debug and test workflows using LangGraph Studio.

A deep dive into agent memory systems: using SQLite for short-term memory (LangGraph state) and Qdrant for long-term memory storage.

We'll implement Text-to-Speech (with ElevenLabs) and Speech-to-Text (with Whisper), giving your agent the ability to listen and speak naturally.

We’ll integrate a Vision-Language Model to interpret images and a Diffusion Model to generate realistic, high-quality images.

Finally, we'll connect the full agent backend to a Telegram Bot — enabling real-time conversations, image processing, and voice interactions directly on your phone.

By the end of the module, I'll also share practical tips on how to improve the system further and specialize it for different business use cases.

Instructor

Miguel Otero Pedrido

ML Engineer|Founder

Workshop Details

Phone Number

Email Id

I Agree to the Terms & Conditions

Send WhatsApp Updates

Certificate of Completion

Receive a digital (blockchain-enabled) and physical certificate to showcase your accomplishment to the world

Earn your certificate
Share your achievement

Building Intelligent Multimodal Agents: Integrating Vision, Speech & Language

23 August 2025 | 09:30AM - 05:30PM | location RENAISSANCE :- Race Course Rd, Madhava Nagar Extension

About the workshop

Modules

Module 1: Project Overview

Module 2: LangGraph Crash Course

Module 3: Building Agent Memory

Module 4: Speech Systems (TTS and STT)

Module 5: Vision-Language Models and Image Generation

Module 6: Connecting to Telegram

Instructor

Miguel Otero Pedrido

Certificate of Completion

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp

fr

LinkedIn (6)

bscookie

lidc

bcookie

aam_uuid

UserMatchHistory

li_sugr

Microsoft (2)

MR

23 August 2025 | 09:30AM - 05:30PM | RENAISSANCE :- Race Course Rd, Madhava Nagar Extension