Multimodal Large Language Models (MLLMs) have lately become the talk of the AI universe. It is dynamically reshaping how AI systems understand and interact with our complex, multi-sensory world. These multi-sensory inputs that we get can also be coined as our different modalities (images, audio, etc.). From Google’s latest Veo 3, generating state-of-the-art videos to ElevenLabs creating incredibly realistic AI voice overs, these systems are demonstrating capabilities that were once considered to be science fiction.
This comprehensive guide is the first part of a two-part series exploring the intricate world of multimodal LLMs. The second part of this series will explore how these models understand audio-based multimodal content and their practical applications across various industries.
Multimodality is definitely one of the greatest capabilities and advancements in AI models. However, when we deal with several modalities, there will be certain challenges that need to be curbed. Here are two major challenges we face in this regard:
Let’s understand this with a small example:
Since we need to represent the term “cat” whether it’s in the form of text, image, or speech as closely as possible, we should make sure other terms like ”dog” are far from the vicinity of the term “cat”. These embeddings from various modalities need to be correctly aligned across the shared dimensional space.
The solution to our first problem on “how to represent information” can be solved by representation learning. There are 2 types of representations-based learning through which multimodal information could be understood by these multimodal models. These are: Joint Representation and Coordinated Representation.
Joint representation could be defined as a single unified representation of different types of information which could be text, image, video, audio, etc. We combine the embeddings of each modality in a single embedding dimension space.
Here, in this approach, we will pass each modality across its respective encoders. Basically, Text will be passed through a Text Encoder (e.g. BERT) and image across an Image Encoder (e.g. VIT) likewise for other modalities.
We get the embeddings for each modality. Later, these embedding representations merge using a concatenation technique. Then, a projection layer or multimodal attention mechanism will assign certain importance to certain features. The resulting joint embedding will contain the complete semantics of all the input modalities.
This entire system is trained. The individual modality encoders, the fusion mechanism, and the final task-specific layers are all optimized together using a single loss function. This unified training setup allows the model to learn cross-modal correlations more effectively, especially when the modalities are strongly interdependent (e.g. image and its caption like in the COCO dataset).
These joint embeddings are particularly useful when the input modalities are closely aligned or when the available training data is limited, as shared representations help in regularizing the learning process and extracting richer, semantically meaningful features from the combined input.
Read more about the Evolution of Embeddings.
Coordinated Representation learning on the other side has a completely different approach. Here, we learn independent representations alone and then coordinate (or align) them together in the fusion stage. In this approach, each modality (text, image, audio, etc.) is handled by its dedicated model, which is trained separately and may also have its loss function and objective.
Once these models are trained, their individual output embeddings are combined using a coordinated fusion mechanism like late fusion (simple concatenation), cross-modal attention, or statistical alignment methods such as Canonical Correlation Analysis (CCA). The coordination phase focuses on ensuring that the separate embeddings are semantically aligned with each other so that they can jointly contribute to the final prediction. Unlike joint embeddings, coordinated embeddings allow each modality to preserve its own feature structure without being forced into a shared representation space prematurely.
This method is highly effective when modalities are somewhat independent or loosely coupled, when there is abundant modality-specific data, or when computational resources allow for more extensive pre-training. Coordinated embeddings also offer greater flexibility in model architecture and training pipelines, as each modality can be improved independently before coordination.
Let’s try to tabulate our understanding here:
Feature | Explicit Alignment | Implicit Alignment |
Nature | Supervised / Annotated | Unsupervised / Learned during training |
Need for Labels | Requires aligned or annotated data | Does not require explicit alignments |
Approach | Manual or rule-based mapping | Learned via attention or contrastive loss |
Example Tasks | Image captioning with bounding boxes | CLIP, VQA with unsupervised attention |
Advantages | High precision, interpretable | Scalable, flexible, learn fine-grained links |
Challenges | Expensive to label, less flexible | Can be less interpretable, data-hungry |
We will now try to understand another important term that we used in the above section named “fusion” next.
If you want to understand how implicit alignment can be done, read this. In this research paper, the model embeds fragments of images (objects in the image) and fragments of sentences (typed dependency tree relations) into a common space.
Let’s dive a little deeper into this.
The cornerstone of multimodal learning lies in understanding how different types of data can be combined effectively. In other words, it serves as a way to accurately align our different modalities across a unified dimensional space. Fusion strategies determine when and how information from different modalities is integrated, fundamentally shaping the model’s ability to understand complex multimodal inputs.
Fusion refers to the integration of information from multiple modalities such as text, image, and audio into a unified representation. It plays a critical role in enabling models to leverage complementary information from each modality. The goal is to combine features in such a way that the model can make more informed predictions. It’s pretty similar to the concept of fusion that we use in Deep Learning.
There are two widely used strategies for fusion: Early Fusion and Late Fusion.
There also exists a third category – mid-fusion, about which I will explain in a while.
Early Fusion represents the simplest approach to multimodal integration, here the raw data from different modalities is combined at the input level itself before any processing occurs. In early fusion systems, data from various sources such as pixel values from images and tokenized text are concatenated or combined through simple operations at the very beginning of the processing pipeline. This approach allows for comprehensive interaction between modalities from the earliest stages of computation, enabling the model to capture subtle correlations and dependencies that might be lost in later-stage fusion approaches.
Example: Earlier attempts might involve flattening an image and concatenating it with text embeddings before feeding it into a neural network. This is less common in modern sophisticated multimodal LLMs due to their limitations.
Late Fusion takes the opposite approach, processing each modality independently through specialized networks before combining the results at the decision level. Here separate neural networks process each data type using architectures optimized for that specific modality like convolutional neural networks for images, or transformer architectures for text and VIT for images. The outputs from these specialized processors are then combined using techniques such as weighted averaging, concatenation, or more sophisticated fusion modules.
Example: An image classifier identifies objects in an image, and a text classifier analyzes a caption. A separate module then combines/fuses these classifications to say if the caption accurately describes the image.
Mid Fusion or intermediate fusion strikes a balance between early and late approaches by integrating multimodal information at various intermediate layers of the network. This strategy enables the model to capture both low-level cross-modal interactions and high-level semantic relationships. Mid-fusion architectures often employ attention mechanisms or specialized transfer modules that allow information to flow between modality-specific processing streams at multiple points throughout the network. The Multimodal Transfer Module (MMTM) uses this approach by using squeeze and excitation operations to recalculate channel-wise features in each CNN stream based on information from multiple modalities.
Example: Most modern vision-language models (like LLaVA) use this. An image encoder processes the image into a set of feature vectors, and a text encoder processes the text into token embeddings. These are then projected and combined in a way that allows a central LLM to attend to both.
Let’s now try to get an over-the-top understanding of some widely used encoders that are utilized in the VLMS.
If you would like to learn more about various Large Vision Language model architectures click here.
CLIP represents a foundational breakthrough in multimodal learning, introducing a simple yet powerful approach to learning joint representations of images and text through contrastive pre-training. The architecture consists of two separate encoders: a vision encoder that processes images and a text encoder that processes natural language descriptions. These encoders are trained jointly using a contrastive objective that encourages the model to associate images with their corresponding textual descriptions while distinguishing them from unrelated text-image pairs.
The training process for CLIP involves presenting the model with batches (for the sake of understanding the above image let’s say n=5) of n image-caption pairs, where each image is paired with its correct textual description. The model computes embeddings for all images and texts in the batch, creating two sets of n-dimensional vectors.
The contrastive loss function encourages high similarity between correct image-text pairs while penalizing high similarity between incorrect pairs. As we can visualize in the above image the diagonal weights will be maximized and the rest will be penalized. Mathematically, this is expressed as a symmetric cross-entropy loss over similarity scores, where the temperature parameter controls the sharpness of the distribution.
CLIP’s effectiveness came from its ability to learn from naturally occurring image-text pairs found on the internet (400 million scrapped information from the web), eliminating the need for manually annotated datasets. This approach enables the model to learn rich semantic relationships that generalize well to downstream tasks. The learned representations demonstrate remarkable zero-shot capabilities, allowing the model to perform image classification and retrieval tasks on categories it has never seen during training. The success of CLIP has inspired numerous follow-up works and established contrastive pre-training as a dominant methodology in multimodal learning.
Also, do consider reading about ViT here.
SigLIP represents an evolution of the CLIP architecture that addresses some of the computational limitations of the original contrastive approach. While CLIP requires computing similarities between all pairs of images and texts in a batch, SigLIP employs a pairwise sigmoid loss that operates on individual image-text pairs independently. This modification eliminates the need for a global view of all pairwise similarities within a batch, enabling more efficient scaling to larger batch sizes while maintaining or improving performance.
The sigmoid loss function used in SigLIP offers several advantages over the traditional contrastive loss. It provides a more stable training mechanism and better performance with smaller batch sizes, making the approach more accessible with limited computational resources. The pairwise nature of the loss enables more flexible training configurations and better handling of datasets with varying numbers of positive examples per sample.
SigLIP’s architecture maintains the dual-encoder structure of CLIP but incorporates architectural improvements and training optimizations that enhance both efficiency and effectiveness. The model uses separate image and text encoders to generate representations for both modalities, with the sigmoid loss encouraging similarity between matched pairs and dissimilarity between unmatched pairs. This approach has demonstrated superior performance across various image-text tasks while offering improved computational efficiency compared to traditional contrastive methods.
Although RoPE can’t be considered as an encoder model, it definitely is an embedding strategy widely used in large language models.
Rotary Position Embedding (RoPE) represents a sophisticated approach to encoding positional information in transformer-based architectures. RoPE encodes the absolute positional information using rotation matrices while naturally including the explicit relative position dependencies in self-attention formulations. This approach provides valuable properties including flexibility to expand to any sequence length, decaying inter-token dependency with increasing relative distances, and the capability to equip linear self-attention with relative position encoding.
The mathematical foundation of RoPE involves applying rotation matrices to embedding vectors based on their positions in the sequence. This rotation-based approach ensures that the dot product between embeddings captures both content similarity and relative positional relationships. The decay property of RoPE means that tokens that are farther apart in the sequence have naturally reduced attention weights, which aligns well with many natural language and multimodal tasks where local context is typically more important than distant context.
In multimodal applications, RoPE enables models to handle variable-length sequences more effectively, which is crucial when processing multimodal data where different modalities may have different temporal or spatial characteristics. The ability to extrapolate to longer sequences than those seen during training makes RoPE particularly valuable for multimodal models that need to handle diverse input formats and lengths.
Now, let’s see how these concepts and components come together in some open-sourced influential multimodal LLMs, particularly focusing on how they “see.”
LLaVA’s core idea is to demonstrate that a remarkably simple architecture can achieve impressive visual reasoning capabilities by efficiently connecting a pre-trained vision encoder (from CLIP) to a pre-trained Large Language Model (Vicuna) using a single, trainable linear projection layer. It leverages the strong existing capabilities of these unimodal models for multimodal understanding.
LLaVA utilizes pre-trained Vicuna LLM and CLIP vision encoder components. The training is a 2-stage instruction-tuning procedure:
Stage 1: Visual Feature Alignment (Pre-training)
Stage 2: Instruction Fine-tuning (End-to-End)
The LLaVA model processes inputs which can be text, an image, or a combination. Here’s how it works:
LLaVA looks at an image and creates captions for the images using CLIP (vision encoder). A special translator (projection layer) changes these captions into a language the Vicuna LLM understands. The Vicuna brain then reads both the translated captions and any actual text words (like your question). Finally, the Vicuna brain uses all this information to give you an answer in the text.
While not a traditional encoder-decoder in the sequence-to-sequence translation sense, LLaVA uses components that serve these roles:
Llama 3 Vision aims to build state-of-the-art open-source multimodal models by integrating a powerful vision encoder with the strong base of Llama 3 LLMs. The core idea is to leverage Meta’s advancements in LLMs, vision models, and large-scale training methodologies to create models that can perform complex visual reasoning, understand nuanced visual details, and follow intricate instructions involving images and text.
Llama 3 Vision models leverage pre-trained Llama 3 LLMs and powerful pre-trained vision encoders (e.g., CLIP ViT). The training strategy typically involves:
Stage 1: Large-Scale Multimodal Pre-training
Stage 2: Instruction Fine-tuning (End-to-End)
Llama 3 Vision processes image and text inputs to generate textual outputs.
Llama 3 Vision uses a very sharp ViT variant model to look at an image, breaking it down into many detailed picture words(patch info). A projector makes these detailed image captions ready for the super-smart Llama 3 LLM. The Llama 3 brain reads these captions along with any text questions you ask it. Because the Llama 3 brain is so big and well-trained, it can understand complex things in the picture and give you very detailed and intelligent answers in the text.
Similar to LLaVA, it’s a vision encoder + projector + LLM architecture:
While specific, verified details for Llama 4 are still emerging, discussions around its advancements often center on tackling the inherent challenges of large-scale multimodal learning, particularly through architectural innovations like Mixture-of-Experts (MoE).
A key conceptual advancement for Llama 4 is the effective implementation of MoE. This architecture significantly mitigates computational costs by activating only a relevant expert. This allows for enhancing model capacity while keeping the computational load for training and inference manageable.
Such efficiency is crucial for handling increasingly large, high-resolution multimodal datasets and long sequence lengths, which would otherwise be bottlenecked by the quadratic scaling of traditional attention mechanisms. This also enables broader scalability solutions, allowing the model to learn from more extensive and diverse data.
With the capacity afforded by MoE and advancements in training strategies, Llama 4 would aim for a more sophisticated alignment of diverse modalities like images and text. This involves developing more robust representations that can capture modality-specific characteristics (e.g., spatial correlations in vision, semantic rules in text) while enabling deeper cross-modal understanding and interaction.
Llama4 architecture also mentions the use of the Early Fusion mechanism to align the embeddings into a unified representation space. While not its primary purpose, the increased capacity and specialization within an MoE framework could indirectly aid in better handling statistical and even temporal discrepancies between modalities if trained on appropriate data.
Models like Llama 4 are expected to incorporate more advanced strategies to address inherited biases and improve overall robustness. Llama 4 would aim to:
The evolution of multimodal LLMs represents one of the most significant advances in artificial intelligence, fundamentally changing how machines perceive and interact with the world around us. From the foundational concepts of early and late fusion to the sophisticated architectures of modern systems like Llama 4, we have traced the technical journey that has enabled AI systems to understand and process multiple modalities with human-like sophistication. The technical foundations we explored including contrastive learning principles, joint embedding spaces, and alignment mechanisms provide the theoretical framework that makes multimodal understanding possible.
Our case studies of LLaVA, Llama 3.2 Vision, and Llama 4 illustrate the rapid progression of multimodal capabilities. LLaVA demonstrated that elegant simplicity could achieve remarkable results through visual instruction tuning. Llama 3.2 Vision showed how sophisticated cross-attention mechanisms could enable robust multimodal reasoning. Llama 4 represents the current state-of-the-art, introducing mixture-of-experts architectures and unprecedented context lengths that open entirely new categories of applications. In the second part of this series, we will explore how these Multimodal LLMs are able to understand audio.