All You Need to Know About Foundation Models

Aravindpai Pai 03 Jul, 2023 • 9 min read

Have you ever used groundbreaking technologies such as ChatGPT and MidJourney for your professional and personal work? I am sure, you did. These technologies have taken the world by storm and have become part of our lives. Wondering how these technologies work and what makes them so incredibly effective? The answer lies in the power of Foundation Models.

Welcome to the world of Foundation Models – the backbone of modern artificial intelligence! From ChatGPT to Midjourney, these powerful models have revolutionized the way machines understand and process information. But what exactly are Foundation Models, and how do they work?

Foundation models are the driving force behind the next generation of intelligent machines, empowering them to see, hear, and think in ways that were once only possible for humans

– Demis Hassabis, CEO of DeepMind

In this blog post, we’ll learn about Foundation models in detail. We will discuss different types of Foundation models and understand their model training process. And finally look into how we can use them on a downtream tasks. Whether you’re a AI researcher or just getting started with the Generative AI technology, this post has all the information you need to take your understanding of Foundation Models to the next level. So buckle up, and get ready to embark on a fascinating journey into the heart of artificial intelligence!

We’re so excited to invite you to the DataHack Summit 2023, happening from August 2nd to 5th at the prestigious NIMHANS Convention Center in vibrant Bengaluru. Why should you be there, you ask? Because it is THE PLACE for mind-blowing workshops, invaluable insights from industry experts, and the chance to connect with fellow data enthusiasts like never before. Stay ahead of the curve, stay in the know about the latest trends, and unlock endless possibilities in the world of data science and AI. Seize this incredible opportunity! Mark your calendars, secure your spot, and get ready for an unforgettable experience at the DataHack Summit 2023. Click here for more details!

History of Foundation Models

We all might have worked with pretrained models for solving variety of tasks related to text and images. But do you know why we use pretrained models in the first place? The reason behind it is its ability to generalize well on the other tasks. These pretrained models are trained on a huge amounts of labeled datasets. This limits the model to use the large volumes of available unlabeled data. This led to the research in building the models that can make use of the unlabeled datasets i.e. Foundation models.

The term Foundation Model was first coined by the Stanford AI team. These models were indirectly known to us as Large Language Models a.k.a LLMs.

Transformers was the first breakthrough in the domain of Natural Language Processing back in 2017. It’s a large language model relying on the attention mechanisms. Its trained on massive datasets and was observed that it generalizes well to other tasks by applying it successfully with limited data.

This opened up the research around LLMs to explore its capabilities. In 2018, the another 2 popular LLMs were introduced by Google and OpenAI: BERT and GPT.

Next, an intriguing research topic emerged regarding the scaling up of transformer models . It involved examining whether increasing the model’s size and complexity enhances the model performance or increasing the amount of data would enhance its performance.

In 2019, Open AI released a GPT-2, a LLM with 1.5 billion parameters and later on in 2020, GPT-3 by scaling GPT 2 by 116x (with 175 billion parameters). Thats huge!

But the concern with these LLMs is that these LLMs can produce harmful outputs as well. There must be a way to control the outputs generated by LLM. This led to the work in aligning the language models with the instructions. Finally,  instruction aligned models like InstructGPT, ChatGPT, AutoGPT have been a sensation world wide.

History of Foundation Models

What are Foundation Models?

Foundation models are AI models trained on huge amounts of unlabeled datasets that can be used to solve multiple downstream tasks.

What are Foundation Models?

For example, Foundation models trained on the text data can be used to solve problems related to text like Question Answering, Named Entity Recognition, Information Extraction, etc. Similarly, Foundation models trained on images can solve problems related to images like image captioning, object recognition, image search, etc.

They are not just limited to text and images. They can be trained on different types of data like audios, videos, 3d signals as well.

The Foundation models lay the strong base for solving other tasks. Hence, the Stanford team introduced the term “Foundation Models”. The best thing about them is that they can be easily trained without the dependency of a labeled dataset.

But why do we need Foundation Models in the first place? Let’s figure it out.

Also Read: AI and Beyond: Exploring the Future of Generative AI

Why Foundation Models?

There are 3 primary reasons for the need of Foundation models.

All in One

Foundation models are extremely powerful. They have removed the need to train different models for different tasks otherwise would have been the case. Now, it’s just one single model for all the problems.

Easy to Train

Foundation models are easy to train as there is no dependency on labeled data. And little effort is required to adapt it to our specific task.

Foundation models are task agnostic

If not Foundation models, we would need hundreds of thousands of labeled data points to achieve the high performance model for our task. But using Foundation models, it’s just a couple of examples required to adapt it to our task. We will discuss in detail on how to use Foundation models for our tasks in a while.

High Performance

Foundation models help us build very high performance models for our tasks. The State Of The Art (SOTA) architectures for various tasks in Natural Language Processing and Computer Vision are built on top of the Foundation models.

What are the Different Foundation Models?

Foundation models are classified into different types based on the domain that they are trained on. Broadly, it can be classified into 2 types.

  1. Foundation Models for Natural Language Processing
  2. Foundation Models for Computer Vision

Foundation Models for Natural Language Processing

Large Language Models (LLMs) are the Foundation models for Natural Language Processing. Large Language models are trained on massive amounts of datasets to learn the patterns and relationships present in the textual data. The ultimate goal of LLMs is to learn how to represent the text data accurately.

The powerful AI technologies in today’s world rely on LLMs. For example, ChatGPT uses the GPT-3.5 as the Foundation model and AutoGPT, the latest AI experiment is based on GPT-4.

Here is the list of Foundation models for NLP: Transformers, BERT, RoBERTa, variants of GPT like GPT, GPT-2, GPT-3, GPT-3.5, GPT-4 and so on.

We will discuss how these Large Language Models are trained in the next section.

Foundation Models for Computer Vision

Diffusion models are popular examples of Foundation models for computer vision. Diffusion models have emerged as a powerful new family of deep generative models with state of the art performance in multiple use cases like image synthesis, image search, video generation, etc. They have outperformed auto encoders, variational autoencoders, GANs with its imagination and generative capabilities.

The most powerful text to image models like Dalle 2 and Midjourney use the diffusion models behind the hood. Diffusion models can also act as Foundation models for NLP and different multimodal generation tasks like text to video, text to image, as well.

Now, we will discuss how to use Foundation models for downstream tasks.

How Can We Use the Foundation Models?

Till now, we have seen Foundation models and different types of them. Now, we will see how to make use of these models on the downstream tasks after training them.

There are 2 ways to do it. Finetuning and Prompting

The fundamental difference between finetuning and prompting is that

Finetuning change the model itself whereas prompting changes the way to use it.


The first technique is to finetune the Foundation models on the custom dataset as per the task. We will finetune the model specific to our dataset. We would need a dataset consisting of hundreds and thousands of examples along with the target labels for fine tuning the model.

The foundation model is trained on a generic domain dataset. By fine tuning, we are adapting the model parameters specific to our dataset. This technique solves the pain of having the labeled dataset to some extent but not fully.

Here comes the prompting!


Prompting allows us to use the foundation model to solve a particular task just through a set of examples. It doesn’t involve any model training. All you need to do is to prompt the model to solve the task. This is known as in-context learning because the model learns from the context i.e. given a set of examples.

Cool right?

Prompting involves providing design cues or conditions to a network through a few examples in order to solve specific tasks. As an illustration shown below, when dealing with LLMs, a set of task-related examples are provided and the model is expected to predict the target by filling in the blank space/next token in the sentence.

Observing language models learn from examples is a fascinating experience because they are not explicitly trained to do so. Rather, their training revolves around predicting the next token in a given sentence. This seems to be almost magical.


How are Foundation Models Trained?

As discussed, Foundation models are trained on the unlabeled datasets in a self supervised manner.

In self supervised learning, there is no explicitly labeled dataset. The labels are created automatically from the dataset itself and the model is trained in a supervised manner. That’s the fundamental difference between supervised learning and self supervised learning.

There are different foundation models in NLP and computer vision but the underlying principle of training these models is similar.  Let’s understand the training process now.

Large Language Models

As discussed, Large Language Models a.k.a LLMs are the Foundation models for NLP. All these Large Language Models are trained in a similar way but differ in the model architecture. The common learning objective is to predict the missing tokens in the sentence. The missing token can be a next token or anywhere in the text.

Hence, LLMs can be classified into 2 types based on learning objective: Causal LLM and Masked LLM. For example, GPT is a causal LLM that is trained to predict the next token in the text whereas BERT is a Masked LLM that is trained to predict the missing tokens present anywhere in the text.

Causal Language Model vs Masked Language Model

Diffusion Models

Consider we have an image of a dog, we apply gaussian noise to it, then we end up with an unclear image. Now, we repeat this process, apply gaussian noise several times to the image, then we end up with a complete noise. And the image is unrecognizable.

Diffusion Models

Is there a way to undo the unidentified image to the actual image? That’s exactly what diffusion models do.

Diffusion models learn to undo this process of converting the noisy image into its actual and original image.

In simple terms, diffusion models learn to denoise the image. Diffusion models are trained in a 2 step process:

  1. Forward Diffusion
  2. Reverse Diffusion

Forward Diffusion

In the forward diffusion step, the training image is converted to a completely unrecognizable image. This process is fixed and does not require any network for learning unlike the Variational Auto Encoders (VAEs). In VAE, encoder and decoder are 2 different networks jointly trained to convert the image into latent space and back into the original image.

Forward Diffusion

Imagine the process to be similar to ink diffusion in the water. Once the ink diffuses in the water, it’s completely gone. We will not be able to track it.

Reverse Diffusion

Here is the interesting step known as Reverse Diffusion. This is where the actual learning is happening.

In reverse diffusion, the unrecognizable image is converted back into the original image. The single network is trained to convert back the noise into the image. Isn’t it interesting?

reverse diffusion

What Next?

At present, our understanding of Foundation Models is well-defined. The subsequent stage involves developing a comprehensive understanding and proficiency in either Foundation Models for NLP or Foundation Models for Computer Vision, depending on your area of interest.

In case of Foundation models for NLP, you need to build indepth knowledge around LLMs. It includes building LLMs from scratch, training and finetuning LLMs on your dataset and grasping the most effective methods for deploying them in production.

Similarly, for Foundation Models for Compuer Vision, you need to have a thorough understanding of diffusion models, including their creation from the ground up, training and fine-tuning on your datasets, and effective deployment strategies.

Thats all for today! See you soon in next article.

But wait, don’t leave just yet!  Level up your data game at the DataHack Summit 2023, where we’ve lined up a series of workshops that will blow your mind! From Mastering LLMs: Training, Fine-tuning, and Best Practices to Exploring Generative AI with Diffusion Models and Solving Real World Problems Using Reinforcement Learning (and so much more), these workshops are your golden ticket to unlocking immense value. Imagine immersing yourself in hands-on experiences, gaining practical skills and real-world knowledge that you can put into action right away. Plus, this is your chance to rub shoulders with industry leaders and open doors to exciting new career opportunities. Grab your spot and register now for the highly anticipated DataHack Summit 2023.

Aravindpai Pai 03 Jul 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


  • [tta_listen_btn class="listen"]