Vikram M — Published On May 17, 2023 and Last Modified On June 28th, 2023
ChatGPT Foundation Models Generative AI Intermediate Large Language Models LLMs NLP Reinforcement Learning Reinforcement Learning from Human Feedback RLHF

Welcome to the future of AI: Generative AI! Have you ever wondered how machines learn to understand human language and respond accordingly? Let’s take a look at ChatGPT – the revolutionary language model developed by OpenAI. With its groundbreaking GPT-3.5 architecture, ChatGPT has taken the world by storm, transforming how we communicate with machines and opening up endless possibilities for human-machine interaction. The race has officially begun with the recent launch of ChatGPT’s rival, Google BARD, powered by PaLM 2. In this article, we will dive into the inner workings of ChatGPT, how it works, what are different steps involved like Pretraining and RLHF, and explore how it can comprehend and generate human-like text with remarkable accuracy.

“Generative AI opens up new creative possibilities that we never thought were possible before.”

Douglas Eck, Research Scientist at Google Brain

Explore inner workings of ChatGPT and explore how it can comprehend and generate human-like text with remarkable accuracy. Get ready to be amazed by the cutting-edge technology behind ChatGPT and discover the limitless potential of this powerful language model.

The key objectives of the article are-

  1. Discuss the steps involved in the model training of ChatGPT.
  2. Find out the advantages of using Reinforcement Learning from Human Feedback (RLHF).
  3. Understand how humans are involved in making models like ChatGPT better.

Get ready to ignite your passion for data science and AI at the highly anticipated DataHack Summit 2023! Mark your calendars for an unforgettable experience from 2nd to 5th August at the prestigious NIMHANS Convention Centre in Bangalore. It’s time to level up your knowledge and skills with hands-on learning, industry insights, and unparalleled networking opportunities. Join a dynamic community of data-driven minds, where you’ll connect with experts, explore cutting-edge technologies, and unlock the secrets to success in this fast-paced field. Are you ready to take the plunge? Secure your spot now and be part of this thrilling journey at the DataHack Summit 2023.

Overview of ChatGPT Training

ChatGPT is a Large Language Model (LLM) optimized for dialogue. It is built on top of GPT 3.5 using Reinforcement Learning from Human Feedback (RLHF). It is trained on massive volumes of internet data.

There are mainly 3 steps involved in building ChatGPT-

  1. Pretraining LLM
  2. Supervised Finetuning of LLM (SFT)
  3. Reinforcement Learning from Human Feedback (RLHF)

The first step is to pretrain the LLM (GPT 3.5) on the unsupervised data to predict the next word in the sentence. This makes LLM learn the representation and various nuances of the text.

In the next step, we finetune the LLM on the demonstration data: a dataset with the questions and answers. This optimizes the LLM for dialogue.

In the final step, we use RLHF to control the responses generated by the LLM. We are prioritizing the better responses generated by the model using RLHF.

Now, we will discuss each step in detail.

Pretraining LLM

Language models are statistical models that predict the next word in a sequence. Large language models are deep learning models trained on billions of words. The training data is scraped from multiple websites like Reddit, StackOverflow, Wikipedia, Books, ArXiv, Github, etc.

Dataset and parameters in different LLMs. ChatGPT uses GPT-3

We can see the above image and get an idea of the side of the dataset and the number of parameters. The pretraining of LLM is computationally expensive as it requires massive hardware and a vast dataset. At the end of pretraining, we will obtain an LLM that can predict the next word in the sentence when prompted. For example, if we prompt a sentence, “Roses are red and”, it might respond with “Violets are blue.” The below image depicts what GPT-3 can do at the end of pretraining:

Pretraining GPT-3 model.What GPT-3 can do at the end of pretraining.

We can see that the model is trying to complete the sentence rather than answering it. But we need to know the answer rather than the next sentence. What could be the next step to achieve it? Let us see this in the next section.

Also Read: Prompt Engineering: Rising Lucrative Career Path AI Chatbots Age

Supervised Finetuning of LLM

So, how do we make the LLM answer the question rather than predict the next word? Supervised Finetuning of the model would help us solve this problem. We can tell the model the desired response for a given prompt and fine-tune it. For this, we can create a dataset of multiple types of questions to ask a conversational model. Human labelers can provide the appropriate responses to make the model understand the expected output. This dataset consisting of pairs of prompts and responses is called Demonstration Data. Now, let us see a sample dataset of prompts and their responses in the demonstration data.

Supervised Finetuning of LLM

Reinforcement Learning from Human Feedback (RLHF)

Now, we are going to learn about RLHF. Before understanding RLHF, let us first see the benefits of using RLHF.


After supervised finetuning, our model should give us the appropriate responses for the given prompts, right? Unfortunately, No! Our model might still not properly answer every question that we ask it. It might still be unable to evaluate which response is good and which is not. It could have to overfit the demonstration data. Let us see what could happen if it overfits the data. While writing this article, I asked Bard this:

what RLHF is important in making model like GPT

I did not give it any link, article, or sentence to summarize. But it just summarized something and gave it to me, which was unexpected.

One more problem which might arise is its toxicity. Though the answer might be right, it might not be right ethically and morally. For example, look at the image below, which you might have seen before. When asked for the websites to download movies, it first responds that it is not ethical, But in the next prompt, we can easily manipulate it as shown.

Fine tuning with RLHF

Ok, now go ahead to ChatGPT and try the same example. Did it give you the same result?

Why are we not getting the same answer? Did they retrain the entire network? Probably not! There might have been a small fine-tuning with RLHF. You can refer to this beautiful gist for more reasons.

Reward Model

The first step in RLHF is to train a reward model. The model should be able to take the response of a prompt as input and output a scalar value that depicts how good the response is. For the machine to learn what a good response is, can we ask the annotators to annotate the responses with rewards? Once we do this, there might be biases in rewarding the responses by different annotators. So the model might not be able to learn how to reward the responses. Instead, the annotators can rank the responses from the model, which would reduce the bias in the annotations to a great extent. The below image shows a chosen response and rejected response for a given prompt from Anthropic’s hh-rlhf dataset.

Chosen response from hh-rlhf dataset
Rejected response from hh-rlhf dataset

From this data, the model tries to distinguish between a good and bad response.

Finetuning LLM with Reward Model Using RL

Now, we finetune the LLM with Proximal Policy Approximation(PPO). In this approach, we get the reward for the response generated by the initial language model and the current iteration of the fine-tuned iteration. We compare the current language model with the initial language model so that the language model does not deviate too much from the right answer while generating a neat, clean, and readable output. KL-divergence is used to compare both models and then finetune the LLM.

Model Evaluation

The models have been constantly evaluated at the end of each step with a different number of parameters. You can see the methods and their respective scores in the images below:

Different methods of Model Evaluation

We can compare the performance of the LLMs at different stages w.r.t different model sizes in the above figure. As you can see, there is a significant increase in the results after each training phase.

We can replace the Human in RLHF in this segment with Artificial Intelligence RLAIF. This significantly reduces the cost of labeling and has the potential to perform better than RLHF. Let’s discuss that in the next article.


In this article, we saw how conversational LLMs like ChatGPT are trained. We saw the three phases of training ChatGPT and how reinforcement learning from human feedback has helped the model improve its performance. We also understood the importance of each step, without which the LLM would be inaccurate.

Hope you enjoyed reading it. Feel free to leave comments below in case of any query/feedback. Happy Learning 🙂

But hey! Before you go, I’d like to bring your attention to the workshops lined up at the DataHack Summit 2023. The workshops- ‘Applied Machine Learning with Generative AI‘, ‘ and ‘Mastering LLMs: Training, Fine-tuning, and Best Practices‘, ‘Exploring Generative AI with Diffusion Models’, would definitely pique your interest as they are not your ordinary learning experiences. They’re designed to empower you with practical skills and equip you with real-world knowledge that will set you apart. With hands-on practice and expert guidance, you’ll gain the confidence to conquer any data challenge that comes your way. But here’s the catch: spots are filling up fast! Don’t miss out on this invaluable opportunity to enhance your expertise, network with industry leaders, and unlock exciting career prospects. Secure your spot now and be a part of the DataHack Summit 2023.

Frequently Asked Questions

Q1. How does ChatGPT get its data?

A. ChatGPT gets its data from multiple websites like Reddit, StackOverflow, Wikipedia, Books, ArXiv, Github, etc. It uses this data to learn patterns, grammar, and facts.

Q2. How to earn money using ChatGPT?

A. ChatGPT itself does not provide a direct way to earn money. However, individuals or organizations can utilize the capabilities of ChatGPT to develop applications or services that can generate revenue, such as blogging, virtual assistants, customer support bots, or content generation tools.

Q3. How does ChatGPT actually work?

A. ChatGPT is a Large Language Model optimized for dialogue. It accepts prompts as an input and returns the response/answer. It uses GPT 3.5 and Reinforcement Learning from Human Feedback (RLHF) as the core working principles.

Q4. What algorithm does ChatGPT use?

A. ChatGPT uses Deep Learning and Reinforcement Learning behind the scenes. It is developed in 3 phases: Pretraining Large Language Model (GPT 3.5), Supervised Finetuning, Reinforcement Learning from Human Feedback (RLHF).

Leave a Reply Your email address will not be published. Required fields are marked *