Building Real-World AI Systems With Small Language Models

  • Aug 08, 2026
  • 09:30AM – 05:30PM

About the Workshop

This full-day, hands-on workshop equips participants to design, build, and optimize real-world AI systems powered by Small Language Models (SLMs). Unlike traditional LLM-heavy approaches, this workshop focuses on cost-efficient, production-aware architectures that run entirely within Google Colab’s free tier—making advanced AI engineering accessible without expensive compute infrastructure. 

Participants will progress through six tightly integrated modules, building toward a complete multi-agent, RAG-enabled AI system that solves a real-world problem. Every module includes end-to-end hands-on demos with pre-configured notebooks that participants take home after the session. 

Key Learning Outcomes 

By the end of this workshop, participants will be able to: 

  • Deploy and run inference with SLMs (Phi-3 Mini, Gemma 2B, TinyLlama) within Google Colab free-tier limits 
  • Apply quantization techniques  and parameter-efficient fine-tuning with QLoRA for domain-specific tasks 
  • Build a lightweight RAG pipeline with vector search, connecting fine-tuned SLMs to external knowledge bases 
  • Design and orchestrate multi-agent workflows using role-specialized SLMs with shared state and minimal memory overhead 
  • Architect an end-to-end Agentic RAG system combining retrieval, reasoning, and generation under constrained compute 
  • Simulate edge deployment using llama.cpp with GGUF models, profiling latency and optimizing for CPU-first execution 

Prerequisites

  • Proficiency in Python (intermediate level; ability to read and write functions, classes, and scripts)

  • Foundational familiarity with machine learning concepts or prior exposure to LLMs/NLP is helpful but not required

  • A Google account for Colab and Drive access

Workshop Modules

  • What are SLMs? Architecture overview and key design philosophies (Phi-3 Mini, Gemma 2B, TinyLlama, SmolLM) 
  • Analyze SLM vs. LLM trade-offs including latency, cost, and accuracy while understanding tokenization and context window limits in resource-constrained environments 
  • Hands-on: Load and run inference with Phi-3 Mini using HuggingFace Transformers — text summarization and classification tasks 
  • Hands-on: Benchmark multiple SLMs on the same prompt — compare output quality, token throughput, and memory usage 

  • when to fine-tune vs. prompt engineer 
  • Core principles of quantization: Exploring INT4 and INT8 methods; mastering the GGUF format tailored for CPU-based inference 
  • Parameter-efficient fine-tuning with QLoRA. 
  • Hands-on: Fine-tune Phi-3 Mini or Gemma 2B with QLoRA on a domain-specific dataset  
  • Hands-on: Evaluate fine-tuned vs. base model performance. 
  • Hands-on: Export and save fine-tuned adapters; merge and reload for downstream use in Modules 3–5 

  • Core RAG components: exploring chunking methods, embeddings and vector database options including ChromaDB and FAISS 
  • Hands-on: Construct a streamlined RAG pipeline to process a document set and perform queries using the Module 2 fine-tuned SLM 
  • Hands-on: Implement a Document Q&A system to extract information from research papers and technical manuals 
  • Evaluation: Measuring RAG success through relevance, faithfulness, and the mitigation of hallucinations 
  • Comparison RAG Integration with LLM vs Fine-Tuned SLM 

  • Agentic AI design patterns: tool-use, role specialization, shared state, and inter-agent communication 
  • Hands-on: Build a multi-agent pipeline for a scenario like automated research summarization . 
  • Comparison of Agents with LLM vs Fine-Tuned SLM 

  • Agentic RAG architecture: combining retrieval grounding with multi-step agentic reasoning 
  • Use case: Intelligent Document Assistant — a system that accepts user queries, retrieves relevant passages, reasons over them using specialized agents, and returns structured, cited answers 
  • Hands-on:  
  • Assemble the full pipeline end-to-end — RAG retrieval agent + reasoning agent + response synthesis agent using fine-tuned SLMs 
  • Add Adaptive routing: query intent classification to dynamically select retrieval strategy or direct generation 
  • Evaluate and Comparison of Agentic RAG with LLM vs Fine-Tuned SLM 

  • Introduction to Edge AI and TinyML concepts 
  • Hands-on:  
  • Simulating edge deployment using CPU-only environments 
  • Export our fine-tuned SLM from Module 2 and serve it locally via llama.cpp 
  • TinyML outlook: deploying to microcontrollers and mobile devices with frameworks like ONNX Runtime Mobile and TensorFlow Lite 

 

Introduction to Edge AI and TinyML concepts 

Simulating edge deployment using CPU-only environments 

Measuring latency, throughput, and efficiency 

Exporting and adapting models for edge scenarios 

Discussion: privacy, real-time inference, and deployment trade-offs 

Instructor

  • 00 Days
  • 00 hrs
  • 00 Min
  • 00 Sec
Last 5 Tickets at 40% Off!
Book Tickets

Workshop Details