Ganeshi Shreya — Published On December 11, 2022 and Last Modified On April 17th, 2023
Advanced ChatGPT Reinforcement Learning


‘Hey, Siri, ‘Hey, Google,’ and ‘Alexa’ are some common voice assistants we use on an everyday basis. These fascinating conversational bots use Natural Language Understanding to understand the inputs. NLU is a subset of Natural Language Processing that enables the machine to understand the natural language (text/audio). NLU is a critical component in most the NLP applications like Machine translation, Speech recognition, Building chatbots, etc. The foundation of NLU is the Language model.

In this article, we will discuss the state-of-the-art language models by Open AI, GPT, and its variants and how it led to the breakthrough of ChatGPT. Some of the points covered in this article include:

  • Learn about ChatGPT and its model training process.
  • Understand the brief history of GPT architectures – GPT 1, GPT 2, GPT 3 and InstructGPT.
  • In-depth understanding of Reinforcement Learning from Human Feedback(RHLF).

Let’s get started!

Overview of GPT Family

The state-of-the-art architecture for language models is transformers. The working of a transformer is no less than magic. OpenAI came up with one such transformer, i.e., a Generative Pre-trained Transformer Model, popularly known as GPT.

GPT is developed in a self-supervised fashion. The model is trained over a massive dataset to predict the next word in the sequence. This is known as casual language modeling. This language model is then finetuned on a supervised dataset for the downstream tasks.

GPT family

OpenAI released three different versions of GPT i.e., GPT-1, GPT-2, and GPT-3, to generate human-like conversations. The 3 versions of GPT differ in size. Each new version was trained by scaling up the data and parameters.


GPT-3 is referred to as an autoregressive model that is trained to make predictions only by looking at past values. GPT-3 can be used to develop huge applications like search engines, content creation, and many more. But why did GPT-3 fail to achieve human-like conversations? Let’s find out.

Why InstructGPT?

There are 2 primary reasons why GPT-3 failed.

One of the problems with GPT-3 is that the model output is not aligned with the user instructions/prompts. To put it in short, GPT-3 cannot generate a user-preferred response.

For example, given a prompt “Explain the moon landing to a 6-year-old in a few sentences”, GPT-3 generated the unwanted response as shown in the figure below. The main reason behind such responses is that the model is trained to predict the next word in the sentence. GPT-3 is not trained to generate human preferred responses.


Another problem is that it can generate unsafe and harmful comments as it does not have control over the text.

In order to resolve both of these problems- alignment and harmful comments, a new language model was trained that can address these challenges. We will learn more about it in the next section.


What is InstructGPT?

InstructGPT is a language model that generates the user-preferred response with the intent of safe communication. Hence, it is known as the Language model aligned with the following instructions. It uses a learning algorithm called Reinforcement Learning from Human Feedback (RLHF) to generate safer responses.

Reinforcement Learning from Human Feedback is a deep reinforcement learning technique that takes into account human feedback for learning. Human experts control the learning algorithm by providing the most likely human responses from the list of responses generated by the model. This way, the agent mimics safe and truthful responses.

But why Reinforcement Learning from Human Feedback? Why not traditional Reinforcement Learning systems?

Traditional Reinforcement Learning systems require the reward function to be defined to understand whether the agent is moving in the right direction and aim to maximize the cumulative rewards. But, communicating the reward function to the agent in modern Reinforcement Learning environments is very challenging. Hence, instead of defining the reward function for the agent, we train the agent to learn the reward function based on human feedback. This way, the agent can learn the reward function and understand the environment’s complex behaviors.

In the next section, we will learn about one of the most trending topics in the field of AI – ChatGPT.


Introduction to ChatGPT

ChatGPT is now a buzz in the data science field. ChatGPT is simply a chatbot that mimics human conversations. It can answer any questions given to it and remembers the conversations that happened earlier. For example, given a prompt ‘code for decision tree’, ChatGPT responded with the implementation of the decision tree in python as shown in the figure below. That’s the power of ChatGPT. We will look at more hilarious examples at the end.



According to Open AI, ChatGPT is a sibling model to InstructGPT, which is trained to follow instructions in a prompt and provide a detailed response. It is a modified version of the InstructGPT with a change in the model training process. It can remember the conversations that happened earlier and then respond accordingly.

Now let’s see how Instruct GPT and ChatGPT are different. Even though Reinforcement Learning from Human Feedback is incorporated, InstructGPT is not fully aligned and thus is still toxic. Hence, this led to the breakthrough of ChatGPT with changes in the data collection setup.


How is ChatGPT built?

ChatGPT is trained similarly to InstructGPT with a change in the data collection. Let’s understand the working of each phase now.

In this first step, we finetune the GPT-3 on the dataset containing a pair of prompts and relevant answers. It is a supervised fine-tuning task. The relevant answers are provided by the expert labeler.

In the next step, we will learn the reward function that helps the agent to decide what is right and wrong and then move in the right direction of the goal. The reward function is learned through human feedback, thus ensuring the model’s generation of safe and truthful responses.

Here is the list of steps involved in the reward modeling task-

  1. Multiple responses are generated for the given prompt
  2. The human labeler compares the list of prompts generated by the model and ranks it from best to worst.
  3. This data is then used to train the model.

In the final step, we will learn the optimal policy against the reward function using the Proximal Policy Optimization algorithm (PPO). PPO is a new class of reinforcement learning techniques introduced by Open AI. The idea behind the PPO is to stabilize the agent training by avoiding too large policy updates.


Steps involved in model training



Hilarious Prompts of ChatGPT

Now, we will look at some of the hilarious prompts generated by ChatGPT.

Prompt 1:


Prompt 2:


Prompt 3:




This brings us to the end of the article. In this article, we discussed ChatGPT and how it is trained using Deep Reinforcement Learning techniques. We also covered a brief history of GPT variants and how they led to ChatGPT.

ChatGPT is an absolute sensation in the history of AI, but there is a lot more to it to achieve human intelligence. You can try ChatGPT here.

Hope you liked the article. Please let me know your thoughts and views on ChatGPT in the comments below.

About the Author

Ganeshi Shreya

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

One thought on "Understanding ChatGPT and Model Training in Simple Terms"

Karthik says: December 11, 2022 at 6:53 pm
Great Effort! Found it very informative!! Reply

Leave a Reply Your email address will not be published. Required fields are marked *