Beginner’s Guide to Build Large Language Models from Scratch

Aravindpai Pai 02 Apr, 2024 • 15 min read

Be it X or Linkedin, I encounter numerous posts about Large Language Models(LLMs) for beginners each day. Perhaps I wondered why there’s such an incredible amount of research and development dedicated to these intriguing models. From ChatGPT to Gemini, Falcon, and countless others, their names swirl around, leaving me eager to uncover their true nature. These burning questions have lingered in my mind, fueling my curiosity. This insatiable curiosity has ignited a fire within me, propelling me to dive headfirst into the realm of LLMs.

Join me on an exhilarating journey as we will discuss the current state of the art in LLMs for begineers. Together, we’ll unravel the secrets behind their development, comprehend their extraordinary capabilities, and shed light on how they have revolutionized the world of language processing.

Join me on an exhilarating journey as we will discuss the current state of the art in LLMs. Together, we’ll unravel the secrets behind their development, comprehend their extraordinary capabilities, and shed light on how they have revolutionized the world of language processing.

Learning Objectives

Learn about LLMs and their current state of the art.
Understand different LLMs available and approaches to training these LLMs from scratch
Explore best practices to train and evaluate LLMs

This article was published as a part of the Data Science Blogathon.

A Brief History of Large Language Models
What are Large Language Models?
Why Large Language Models?
Different Kinds of LLMs
What are the Challenges of Training LLMs for beginner’s?
Understanding the Scaling Laws
How Do You Train LLMs from Scratch?
How Do You Evaluate LLMs?
Frequently Asked Questions?

A Brief History of Large Language Models

The history of Large Language Models goes back to the 1960s. In 1967, a professor at MIT built the first ever NLP program Eliza to understand natural language. It uses pattern matching and substitution techniques to understand and interact with humans. Later, in 1970, another NLP program was built by the MIT team to understand and interact with humans known as SHRDLU.

In 1988, RNN architecture was introduced to capture the sequential information present in the text data. But RNNs could work well with only shorter sentences but not with long sentences. Hence, LSTM was proposed in 1997. During this period, huge developments emerged in LSTM-based applications. Later on, research began in attention mechanisms as well.

Two Major Concerns With LSTM

LSTM solved the problem of long sentences to some extent but it could not really excel while working with really long sentences. Training LSTM models cannot be parallelized. Due to this, the training of these models took longer time.

In 2017, there was a breakthrough in the research of NLP through the paper Attention Is All You Need. This paper revolutionized the entire NLP landscape. The researchers introduced the new architecture known as Transformers to overcome the challenges with LSTMs. Transformers essentially were the first LLM developed containing a huge no. of parameters. Transformers emerged as state-of-the-art models for LLMs. Even today, the development of LLM remains influenced by transformers.

Over the next five years, there was significant research focused on building better LLMs for begineers compared to transformers. The size of LLM exponentially increased over time. The experiments proved that increasing the size of LLMs and datasets improved the knowledge of LLMs. Hence, GPT variants like GPT-2, GPT-3, GPT 3.5, GPT-4 were introduced with an increase in the size of parameters and training datasets.

In 2022, there was another breakthrough in NLP, ChatGPT. ChatGPT is a dialogue-optimized LLM that is capable of answering anything you want it to. In a couple of months, Google introduced Gemini as a competitor to ChatGPT.

In the last 1 year, there have been hundreds of Large Language Models developed. You can get the list of open-source LLMs along with the ranking on the Hugging Face Open LLM leaderboard. The state-of-the-art LLM to date is Falcon 40B Instruct.

What are Large Language Models?

Simply put this way, Large Language Models are deep learning models trained on huge datasets to understand human languages. Its core objective is to learn and understand human languages precisely. Large Language Models enable the machines to interpret languages just like the way we, as humans, interpret them.

Large Language Models learn the patterns and relationships between the words in the language. For example, it understands the syntactic and semantic structure of the language like grammar, order of the words, and meaning of the words and phrases. It gains the capability to grasp the whole language itself.

But how exactly is language models different from Large Language Models?

Language models and Large Language models learn and understand the human language but the primary difference is the development of these models.

Language models are generally statistical models developed using HMMs or probabilistic-based models whereas Large Language Models are deep learning models with billions of parameters trained on a very huge dataset.

Why Large Language Models?

The answer to this question is simple. LLMs for begineers are task-agnostic models. Literally, these models have the capability to solve any task. For example, ChatGPT is a classical example of this. Every time you ask ChatGPT something, it amazes you.

And one more astonishing feature about these LLMs for begineers is that you don’t have to actually fine-tune the models like any other pretrained model for your task. All you need do is to prompt the model. It does the job for you. Hence, LLMs provide instant solutions to any problem that you are working on. Moreover, it’s just one model for all your problems and tasks. Hence, these models are known as the Foundation models in NLP.

Different Kinds of LLMs

LLMs can be broadly classified into 2 types depending on their task:

Continuing the text
Dialogue optimized

Continuing the Text

These LLMs are trained to predict the next sequence of words in the input text. Their task at hand is to continue the text.

For example, given the text “How are you”, these LLMs might complete the sentence with “How are you doing? or “How are you? I am fine.

The list ofLLMs for begineers falling under this category are Transformers, BERT, XLNet, GPT, and its variants like GPT-2, GPT-3, GPT-4, etc.

Now, the problem with these LLMs is that its very good at completing the text rather than answering. Sometimes, we expect the answer rather than completion.

As discussed above, given How are you? as an input, LLM tries to complete the text with doing? or I am fine. The response can be either of them: completion or an answer. This is exactly why the dialogue-optimized LLMs were introduced.

Dialogue Optimized

These LLMs respond back with an answer rather than completing it. Given the input “How are you?”, these LLMs might respond back with an answer “I am doing fine.” rather than completing the sentence.

The list of dialogue-optimized LLMs is InstructGPT, ChatGPT, Gemini, Falcon-40B-instruct, etc.

Now, we will see the challenges involved in training LLMs from scratch.

What are the Challenges of Training LLMs for beginner’s?

Training LLMs from scratch are really challenging because of 2 main factors: Infrastructure and Cost.

Infrastructure

LLMs for begineers are trained on a massive text corpus ranging at least in the size of 1000 GBs. The models used to train on these datasets are very large containing billions of parameters. In order to train such large models on the massive text corpus, we need to set up an infrastructure/hardware supporting multiple GPUs. Can you guess the time taken to train GPT-3 – 175 billion parameter model on a single GPU?

It would take 288 years to train GPT-3 on a single NVIDIA Tesla V100 GPU.

This clearly shows that training LLM on a single GPU is not possible at all. It requires distributed and parallel computing with thousands of GPUs.

Just to give you an idea, here is the hardware used for training popular LLMs-

Falcon-40B was trained on 384 A100 40GB GPUs, using a 3D parallelism strategy (TP=8, PP=4, DP=12) combined with ZeRO.
Researchers calculated that OpenAI could have trained GPT-3 in as little as 34 days on 1,024x A100 GPUs
PaLM (540B, Google): 6144 TPU v4 chips used in total.

Cost

It’s very obvious from the above that GPU infrastructure is much needed for training LLMs for begineers from scratch. Setting up this size of infrastructure is highly expensive. Companies and research institutions invest millions of dollars to set it up and train LLMs from scratch.

It is estimated that GPT-3 cost around $4.6 million dollars to train from scratch

On average, the 7B parameter model would cost roughly $25000 to train from scratch.

Now, we will see the scaling laws of LLMs.

Understanding the Scaling Laws

Recently, we have seen that the trend of large language models being developed. They are really large because of the scale of the dataset and model size.

When you are training LLMs from scratch, its really important to ask these questions prior to the experiment-

How much data do I need to train LLMs from scratch?
What should be the size of the model?

The answer to these questions lies in scaling laws.

Scaling laws determines how much optimal data is required to train a model of a particular size.

In 2022, DeepMind proposed the scaling laws for training the LLMs with the optimal model size and dataset (no. of tokens) in the paper Training Compute-Optimal Large Language Models.These scaling laws are popularly known as Chinchilla or Hoffman scaling laws. It states that

The no. of tokens used to train LLM should be 20 times more than the no. of parameters of the model.

1,400B (1.4T) tokens should be used to train a data-optimal LLM of size 70B parameters. So, we need around 20 text tokens per parameter.

Next, we will see how to train LLMs from scratch.

How Do You Train LLMs from Scratch?

The training process of LLMs is different for the kind of LLM you want to build whether it’s continuing the text or dialogue optimized. The performance of LLMs mainly depends upon 2 factors: Dataset and Model Architecture. These 2 are the key driving factors behind the performance of LLMs.

Let’s discuss the now different steps involved in training the LLMs.

Continuing the Text Tutorial

The training process of the LLMs that continue the text is known as pretraining LLMs. These LLMs are trained in self-supervised learning to predict the next word in the text. We will exactly see the different steps involved in training LLMs from scratch.

a. Dataset Collection

The first step in training LLMs is collecting a massive corpus of text data. The dataset plays the most significant role in the performance of LLMs. Recently, OpenChat is the latest dialog-optimized large language model inspired by LLaMA-13B. It achieves 105.7% of the ChatGPT score on the Vicuna GPT-4 evaluation. Do you know the reason behind its success? It’s high-quality data. It has been finetuned on only ~6K data.

The training data is created by scraping the internet, websites, social media platforms, academic sources, etc. Make sure that training data is as diverse as possible.

Recent work has demonstrated that increased training dataset diversity improves general cross-domain knowledge and downstream generalization capability for large-scale language models

What does it say?

You might have come across the headlines that “ChatGPT failed at JEE” or “ChatGPT fails to clear the UPSC” and so on. What can be the possible reasons? The reason being it lacked the necessary level of intelligence. This is heavily dependent on the dataset used for training. Hence, the demand for diverse dataset continues to rise as high-quality cross-domain dataset has a direct impact on the model generalization across different tasks.

Unlock the potential of LLMs with the high quality data!

Previously, Common Crawl was the go-to dataset for training LLMs. The Common Crawl contains the raw web page data, extracted metadata, and text extractions since 2008. The size of the dataset is in petabytes (1 petabyte=1e6 GB). It’s proven that the Large Language Models trained on this dataset showed effective results but failed to generalize well across other tasks. Hence, a new dataset called Pile was created from 22 diverse high-quality datasets. It’s a combination of existing data sources and new datasets in the range of 825 GB. In recent times, the refined version of the common crawl was released in the name of RefinedWeb Dataset.Note: The datasets used for GPT-3 and GPT-4 have not been open-sourced in order to maintain a competitive advantage over the others.

b. Dataset Preprocessing

The next step is to preprocess and clean the dataset. As the dataset is crawled from multiple web pages and different sources, it is quite often that the dataset might contain various nuances. We must eliminate these nuances and prepare a high-quality dataset for the model training.

The specific preprocessing steps actually depend on the dataset you are working with. Some of the common preprocessing steps include removing HTML Code, fixing spelling mistakes, eliminating toxic/biased data, converting emoji into their text equivalent, and data deduplication. Data deduplication is one of the most significant preprocessing steps while training LLMs. Data deduplication refers to the process of removing duplicate content from the training corpus.

It’s obvious that the training data might contain duplicate or nearly the same sentences since it’s collected from various data sources. We need data deduplication for 2 primary reasons: It helps the model not to memorize the same data again and again. It helps us to evaluate LLMs better because the training and test data contain non-duplicated information. If it contains duplicated information, there is a very chance that the information it has seen in the training set is provided as output during the test set. As a result, the numbers reported may not be true. You can read more about data deduplication techniques in the paper Deduplicating Training Data Makes Language Models Better

c. Dataset Preparation

During the pretraining phase, the next step involves creating the input and output pairs for training the model. LLMs are trained to predict the next token in the text, so input and output pairs are generated accordingly. While this demonstration considers each word as a token for simplicity, in practice, tokenization algorithms like Byte Pair Encoding (BPE) further break down each word into subwords. The model is then trained with the tokens of input and output pairs.

For example, let’s take a simple corpus-

Example 1: I am a DHS Chatbot.
Example 2: DHS stands for DataHack Summit.
Example 3: I can provide you with information about DHS

In the case of example 1, we can create the input-output pairs as per below-

Similarly, in the case of example 2, the following is a list of input and output pairs-

Each input and output pair is passed on to the model for training.

Now, what next? Let’s define the model architecture.

d. Model Architecture

The next step is to define the model architecture and train the LLM.

As of today, there are a huge no. of LLMs being developed. You can get an overview of different LLMs at the Hugging Face Open LLM leaderboard. There is a standard process followed by the researchers while building LLMs. Most of the researchers start with an existing Large Language Model architecture like GPT-3 along with the actual hyperparameters of the model. And then tweak the model architecture / hyperparameters / dataset to come up with a new LLM.

For example,

Falcon is a state-of-the-art LLM. It ranks first on the open-source LLM leaderboard. Falcon is inspired by GPT-3 architecture with a couple of tweaks.

e. Hyperparameter Search

Hyperparameter tuning is a very expensive process in terms of time and cost as well. Just imagine running this experiment for the billion-parameter model. It’s not feasible right? Hence, the ideal method to go about is to use the hyperparameters of current research work, for example, use the hyperparameters of GPT-3 while working with the corresponding architecture and then find the optimal hyperparameters on the small scale and then interpolate them for the final model.

The experiments can involve any or all of the following: weight initialization, positional embeddings, optimizer, activation, learning rate, weight decay, loss function, sequence length, number of layers, number of attention heads, number of parameters, dense vs. sparse layers, batch size, and drop out.

Let’s discuss the best practices for popular hyperparameters now-

Batch size: Ideally choose the large batch size that fits the GPU memory.
Learning Rate Scheduler: The better way to go about this is to decrease the learning rate as the training progress. This will overcome the local minima and improves the model stability. Some of the commonly used Learning Rate Schedulers are Step Decay and Exponential Decay.
Weight Initialization: The model convergence highly depends on the weights initialized before training. Initializing the proper weights leads to faster convergence. The commonly used weight initialization for transformers is T-Fixup. Use the weight initialization techniques only when you are defining your own LLM architecture.
Regularization: It’s observed that LLMs are prone to overfitting. Hence, it’s necessary to use the techniques like Batch Normalization, Dropout, and L1/L2 regularization that will help the model overcome overfitting.

Dialogue-optimized LLMs

Dialogue-optimized Large Language Models (LLMs) begin their journey with a pretraining phase, similar to other LLMs. Post-pretraining, these models are capable of text completion. To generate specific answers to questions, these LLMs undergo fine-tuning on a supervised dataset comprising question-answer pairs. This process equips the model with the ability to generate answers to specific questions.

ChatGPT, a dialogue-optimized LLM, follows a similar training method. However, after pretraining and supervised fine-tuning, it incorporates an additional step known as Reinforcement Learning from Human Feedback (RLHF).

Interestingly, a recent paper titled “LIMA: Less Is More Alignment” suggests that RLHF might not be necessary. The paper posits that pretraining on a large dataset and supervised fine-tuning on high-quality data (less than 1000 examples) can suffice.

As of now, OpenChat stands as the latest dialogue-optimized LLM, inspired by LLaMA-13B. Having been fine-tuned on merely 6k high-quality examples, it surpasses ChatGPT’s score on the Vicuna GPT-4 evaluation by 105.7%. This achievement underscores the potential of optimizing training methods and resources in the development of dialogue-optimized LLMs.

How Do You Evaluate LLMs?

The evaluation of LLMs cannot be subjective. It has to be a logical process to evaluate the performance of LLMs.

In the case of classification or regression problems, we have the true labels and predicted labels and then compare both of them to understand how well the model is performing. We look at the confusion matrix for this right? But what about large language models? They just generate the text.

There are 2 ways to evaluate LLMs: Intrinsic and extrinsic methods.

Intrinsic Methods

Researchers evaluated traditional language models using intrinsic methods like perplexity, bits per character, etc. These metrics track the performance on the language front i.e. how well the model is able to predict the next word.

Extrinsic Methods

With the advancements in LLMs today, researchers and practitioners prefer using extrinsic methods to evaluate their performance. The recommended way to evaluate LLMs is to look at how well they are performing at different tasks like problem-solving, reasoning, mathematics, computer science, and competitive exams like MIT, JEE, etc.

EleutherAI released a framework called as Language Model Evaluation Harness to compare and evaluate the performance of LLMs. Hugging face integrated the evaluation framework to evaluate open-source LLMs developed by the community.

The proposed framework evaluates LLMs across 4 different datasets. The final score is an aggregation of scores from each dataset.

AI2 Reasoning Challenge: A collection of science questions designed for elementary school students.
HellaSwag: A test that challenges state-of-the-art models to make common-sense inferences, which are relatively easy for humans (about 95% accuracy).
MMLU: A comprehensive test that evaluates the multitask accuracy of a text model. It includes 57 different tasks covering subjects like basic math, U.S. history, computer science, law, and more.
TruthfulQA: A test specifically created to assess a model’s tendency to generate accurate answers and avoid reproducing false information commonly found online.

Also Read: 10 Exciting Projects on Large Language Models(LLM)

Conclusion

Large Language Models (LLMs) have revolutionized the field of machine learning. They have a wide range of applications, from continuing text to creating dialogue-optimized models. Libraries like TensorFlow and PyTorch have made it easier to build and train these models.

However, training LLMs is not without its challenges. It requires substantial infrastructure and can be costly. Understanding the scaling laws is crucial to optimize the training process and manage costs effectively. Despite these challenges, the benefits of LLMs, such as their ability to understand and generate human-like text, make them a valuable tool in today’s data-driven world.

The process of training an LLM involves feeding the model with a large dataset and adjusting the model’s parameters to minimize the difference between its predictions and the actual data. Typically, developers achieve this by using a decoder in the transformer architecture of the model.

Evaluating the performance of LLMs is as important as training them. It helps us understand how well the model has learned from the training data and how well it can generalize to new data.

LLMs have opened up new possibilities in the field of machine learning. They are a testament to how far we’ve come since the early days of AI and a glimpse into what the future might hold. As we continue to explore and push the boundaries of what’s possible with LLMs, who knows what incredible discoveries we’ll make next?

Key Takeaways

Large Language Models (LLMs) like GPT-3, Falcon, and others have revolutionized natural language processing by enabling machines to understand and generate human-like text.
Training LLMs from scratch involves collecting massive datasets, preprocessing, defining model architecture, hyperparameter tuning, and evaluation.
Challenges in training LLMs include infrastructure requirements, such as powerful GPUs and substantial costs, as well as understanding scaling laws to optimize model size and dataset.
One can evaluate LLMs through intrinsic methods like perplexity and extrinsic methods like evaluating task-specific performance on datasets such as AI2 Reasoning Challenge and TruthfulQA.

Frequently Asked Questions?

Q1. What is a large language model?

A. A large language model is a type of artificial intelligence that can understand and generate human-like text. It’s typically trained on vast amounts of text data and learns to predict and generate coherent sentences based on the input it receives.

Q2.What is the difference between NLP and large language models?

A. Natural Language Processing (NLP) is a field of artificial intelligence that focuses on the interaction between computers and humans through natural language. Large language models are a subset of NLP, specifically referring to models that are exceptionally large and powerful, capable of understanding and generating human-like text with high fidelity.

Q3. Is ChatGPT a large language model?

A. Yes, ChatGPT is a large language model. It’s based on OpenAI’s GPT (Generative Pre-trained Transformer) architecture, which is known for its ability to generate high-quality text across various domains.

Q4. What is the difference between LLM and AI?

A. The main difference between a Large Language Model (LLM) and Artificial Intelligence (AI) lies in their scope and capabilities. AI is a broad field encompassing various technologies and approaches aimed at creating machines capable of performing tasks that typically require human intelligence. LLMs, on the other hand, are a specific type of AI focused on understanding and generating human-like text. While LLMs are a subset of AI, they specialize in natural language understanding and generation tasks.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Aravindpai Pai 02 Apr 2024

Beginner Best of Tech Generative AI Guide Large Language Models

Akshat 06 Jul, 2023

Your blog is very helpful and informative. Thanks For Sharing With Us.

Dipak Khatri 06 Jul, 2023

This is the most detailed explanation to the audience.I learn many things.

Vinayak 07 Jul, 2023

Your blog explained very systematic manner and it's very informative.

Advait 25 Sep, 2023

Thanks for you blog. Gives pretty good high level view on LLM.