A Comprehensive Guide to Vision Language Models

About

This talk comprehensively introduces Vision-Language Models (VLMs), their importance, and a wide range of applications. It delves into the technical aspects of pre-training VLMs, covering common techniques and recent advancements. Attendees will gain hands-on experience through live demonstrations using open-source VLMs and minimal reproducible Colab notebooks. Additionally, the talk will focus on fine-tuning PaliGemma, Google's latest VLM, providing a step-by-step guide for specific tasks.

Key Takeaways:

  • In-depth Understanding of VLMs: Participants will learn the fundamentals of VLMs, their significance, and diverse use cases across various domains.
  • Technical Know-how: The talk will equip attendees with knowledge of pre-training techniques, including both established methods and cutting-edge research directions in the field.
  • Practical Skills: Through live code demonstrations using open-source VLMs and Colab notebooks, participants will gain hands-on experience and learn how to work with these models effectively.
  • Fine-tuning Expertise: The talk will provide a detailed walkthrough of fine-tuning PaliGemma, Google's latest VLM, enabling attendees to adapt the model for their tasks.
  • Combination of Theory and Practice: The session will balance conceptual depth and practical techniques, ensuring participants grasp both the theoretical underpinnings and the practical applications of VLMs.

Speaker

Book Tickets
Download Brochure

Download agenda