Finetuning Llama 3 with Odds Ratio Preference Optimization

Ajay Kumar Reddy 02 May, 2024
12 min read


Large Language Models are often trained rather than built, requiring multiple steps to perform well. These steps, including Supervised Fine Tuning (SFT) and Preference Alignment, are crucial for learning new things and aligning with human responses. However, each step takes a significant amount of time and computing resources. One solution is the Odd Ratio Preference Optimization (ORPO), which combines SFT and Preference Tuning in a single step. This guide will explore ORPO and its potential to reduce the time taken to train Large Language Models.

Learning Objectives

  • Understand the typical flow of training a Large Language Model (LLM), including pretraining, supervised fine-tuning, and preference alignment.
  • Identify different training and fine-tuning methods for LLMs, such as supervised fine-tuning and preference optimization (e.g., PPO, DPO, ORPO).
  • Explain the concept of Odds Ratio Preference Optimization (ORPO) and its role in reducing training time and computational resources by combining supervised fine-tuning and preference optimization in a single step.
  • Describe the key components of ORPO, including the odds ratio term in the training loss and its integration with supervised fine-tuning.
  • Learn how to prepare data for finetuning an LLM with ORPO, including data formatting and preprocessing steps.
  • Understand the process of loading and training an LLM with ORPO, including model loading, patching the DPOTrainer, and initiating the training process.
  • Evaluate the effectiveness of ORPO in improving the efficiency and coherence of LLMs by aligning them more closely with human preferences.

This article was published as a part of the Data Science Blogathon.

Typical Flow of LLM Training

  • Pretraining:
    • Large Language Models are pretrained on a large corpus of text data like Wikipedia.
    • This is unsupervised training where the model learns about word sequences and their probabilities.
  • Instruction Tuning:
    • The model is trained to follow instructions provided in the data.
    • Data includes instructions and their corresponding answers.
    • This training enables the model to respond appropriately to user prompts, acting like a chat model.
  • Supervised Fine-Tuning:
    • LLM is trained on domain-specific or task-specific data.
    • Example: fine-tuning to mask Personally Identifiable Information (PII) data.
    • Data contains both masked and unmasked versions of text, allowing the model to learn the task.
  • Alignment-Tuning or Preference Alignment:
    • Aimed at aligning model responses to generate responsible and clean answers.
    • Preference optimization methods include PPO (Policy Preference Optimization), DPO (Direct Preference Optimization), and ORPO (Odds Ratio Preference Optimization).

So we see here that there are different fine-tune stages of an LLM. Each fine-tuning step consumes a lot of time and the larger the data, the more the training time for the LLM. Mainly the Supervised Fine-Tuning and the Preference Alignment, being performed as separate steps, consume a lot of training time.

Introduction to ORPO

ORPO aka Odds Ratio Preference Optimization aims to reduce both the training time and the resources required during the Preference Optimization. It does this by combining both the Supervised Fine-Tuning and the Preference Optimization in a single step. ORPO removes the need for the use of a reward model, which is generally used in other Preference Algorithms like the DPO and the PPO. ORPO believes that the SFT is powerful enough to converge to steer the model to chosen responses from the rejected responses. The formula for the new loss can be seen below:


The Odds Ratio term in ORPO is used to calculate the likelihood of a model generating an output sequence y given an input sequence x. This value indicates that the model is n times more likely to generate the sequence y than not. The odds ratio of chosen responses over rejected responses measures the model’s likelihood of generating chosen responses.

The log of this odds ratio is considered because just taking the ratio of raw probabilities of the chosen over the rejected will produce a very small value. And finally, an activation function like the sigmoid is applied to this log of odds ratio. This final equation is called the ORPO loss and this loss is added to the SFT loss. A tunable parameter lambda is introduced for hyperparameter tuning.

Odds Ratio Preference Optimization

The ORPOTrainer aims to reduce the combined loss of Negative Log Likelihood and ORPO loss by supervised fine-tuning the Large Language Model. This approach focuses on the chosen response and moves it away from rejected ones, eliminating the need for an additional reward model. This approach significantly reduces computation resources for preference tuning and align tuning, thereby reducing training and tuning time for Large Language Models.

Finetuning Llama 3 with ORPO – Data Preparation

We will now proceed with steps of fine-tuning llama 3 with ORPO.

Step1: Installing Libraries

In this section, we will finetune the newly launched Llama 3 with the ORPO. For this, we will be working with the Kaggle Notebook and start by installing the following libraries.

!pip install -U -q xformers --index-url
!pip install -q "unsloth[kaggle-new] @ git+"
!pip install -q datasets trl transformers accelerate huggingface-cli wandb
  • xformers: A library launched by Meta that allows us to work with flexible transformer parts, thus allowing us to combine different parts of LLMs.
  • unsloth: This is a library that we will be working with to train the Llama 3. Unsloth is known to speed the training process of Large Language Models and reduce the GPU memory consumption.
  • datasets: A library from huggingface which we will work with to download a dataset to finetune on
  • trl: A library from huggingface for training the Large Language Models.
  • transformers: We will work with this library to download the model from huggingface.
  • accelerate: We need this to speed up the GPU inference for the Large Language Models.
  • huggingface-cli: We need this library to login into huggingface to download the llama-3 model because llama-3 requires authentication to use it.

Step2: Sign in HuggingFace Account

To work with the Meta Model, first, we need to accept their terms and conditions. Go to this link, sign in with your HuggingFace account, and accept their agreement policy. After this, we will log in to our HuggingFace account through the huggingface-cli command.

Step3: Dataset Loading and Data Preprocessing

We will start with dataset loading and data preprocessing part. First, we need to log in with our huggingface account so we can access and download Meta’s Llama 3 8B model and the tokenizer. For this, the code will be:

!huggingface-cli login --token $you_api_key

Here in the above command, provide your HuggingFace token. This token can be obtained from the HuggingFace website. Running this command will log us into our HuggingFace account and we see the following output:

Data loading

Step4: Download the Model

Next, we will download the model. The code for this will be:

from transformers import AutoTokenizer

base_model = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model)
  • We import the AutoTokenizer Class from the transformers library.
  • Here we first define the model name in the variable base_model.
  • Then we call the AutoTokenizer.from_petrained() function and pass it the base_model variable.

Running the code will download the Llama3 Tokenizer from the Meta HuggingFace repository. This tokenizer is necessary to apply the chat format of Llama 3 for the dataset that we will be working with and to tokenize them.

Step5: Finetune Llama 3

Now we will download the dataset that we will finetune our Llama 3 on. The code for this will be:

from datasets import load_dataset

dataset_name = "jondurbin/truthy-dpo-v0.1"
dataset = load_dataset(dataset_name)
  • Here we import the load_dataset class from the datasets library.
  • Then we provide the path for our dataset to the dataset_name variable.
  • This dataset_name variable is given to the load_dataset() function, which downloads the dataset from the HuggingFace hub.

Running this code will download the data “truthy-dpo-v0.1” from the huggingface and store it in the variable dataset. A few rows from the dataset can be seen below:

Odds Ratio Preference Optimization

We will be working with the four columns in the dataset. These are the system, prompt, chosen, and rejected columns. The system and the prompt columns contain the system message and the user prompt. The chosen column contains the chosen response and the rejected column contains the rejected response.

Step6: Creating Columns

We need to create new chosen and rejected columns where each of these columns contains both the system message, the user prompt, and the chosen or the rejected response. The code for this can be seen below:

def format_chat_template(row):
    message_chosen = [{"role":"system","content":row['system']},
    message_rejected = [{"role":"system","content":row['system']},
    prompt = row['system'] + '/n' + row['prompt']
    row["chosen"] = tokenizer.apply_chat_template(message_chosen, tokenize=False)
    row["rejected"] = tokenizer.apply_chat_template(message_rejected, tokenize=False)
    row['prompt'] = prompt
    return row

The provided code defines a function called format_chat_template that takes a row of data as input and returns a modified version of that row.

Inside the function, two lists are created:

  • message_chosen: This list represents a chat message with the assistant message as the “chosen” response. It contains three dictionaries, each representing a message from either the system, the user, or the assistant.
  • message_rejected: This list represents a chat message with the assistant message as the “rejected” response. Similar to – message_chosen, it even contains three dictionaries representing messages from the system, user, and assistant.
  • The next line creates a string called prompt by concatenating the system and prompt columns from the input row. This string represents the system’s message followed by the user’s prompt.
  • The function then applies a method called apply_chat_template from a tokenizer object (tokenizer) to the message_chosen and message_rejected lists. This function takes in these messages and applies formatting to them based on the chat format that the Llama 3 takes.
  • Here we assign tokenizer=False because we need back the text, not the tokens.
  • Finally, the modified row is returned as output.

Step7: Applying Function to Dataset

Now, we will apply this function to the Dataset that we have just downloaded. For this, we work with the following code:

import os

dataset =
    num_proc= os.cpu_count(),

Here, we map the function that we have just defined, to the dataset that we have just downloaded from HuggingFace. To map it, we call the map function of the dataset object and pass it the function for formatting and the CPU count, so that execution can be done in parallel. Running this code will modify the data within the dataset with the required formatting for the training process.

Finally, we are done with the data pre-processing part. Next, we will download the Llama-3 8 Billion model and train it with this dataset.

Model Loading and Training

In this section, we will download the model and start the training process.

Step1: Downloading the Model

First, we will begin with downloading the model. The code for this will be:

from unsloth import FastLanguageModel
import torch

max_seq_length = 2048
dtype = None 
load_in_4bit = True 

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit", 
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = secret_value_0, 
  • We start by importing FastLanguageModel from the unsloth library and PyTorch.
  • Then we define 3 variables, the max_seq_length, the maximum tokens that are to be generated by the model, dtype, which we give None for auto-detection and load_in_4bit, where the True implies that we wish to quantize to 4-bit.
  • Now, we call the .from_pretrained() from FastLanguageModel(), and to this, we pass.

Step2: Quantization

Running the above code will download the llama-3 8b model and quantize it to a 4-bit format and it will also fetch the relevant tokenizer.

model = FastLanguageModel.get_peft_model(
    r = 16, 
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, 
    bias = "none",    
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,  
    loftq_config = None, 

Now, we try to get the PEFT version of our model. For this, we call the .get_peft_model() function of the FastLanguageModel class. To this, we pass the following parameters

  • model: This is the model that we have downloaded just now.
  • rank: It is the rank of the LoRA matrix. We provide a value of 16 for it.
  • target_modules: This is a list of target modules for which we wish to create the LoRA on. We will be taking all the attention layers and the linear layers.
  • alpha: This is the LoRA scaling factor. We set this scaling factor to 16, it is usually equal to or double the size of rank.
  • lora_dropout: Defines the percentage of dropping of neurons. Unsloth currently doesn’t support dropout, hence it is set to 0.
  • bias: Unsloth doesn’t support bias terms, hence it is set to none.
  • use_rslora: Wether to enable Rank Stabilized Lora or Not? Set to False.
  • loftq_config: This is set to none because we do not have any LoftQ config.

Running this code will create the LoRA Adapters, which we will be training with the dataset that we have downloaded.

Step3: Patching DPOTrainer

Let’s start by patching the DPOTrainer.

from unsloth import PatchDPOTrainer


The unsloth library has not yet released an official implementation for ORPO Trainer. To address this, the PatchDPOTrainer is imported, which will patch the existing DPOTrainer and ORPOTrainer from the HuggingFace trl library, enhancing its speed and memory efficiency.

from trl import ORPOConfig, ORPOTrainer

orpo_trainer = ORPOTrainer(
    model = model,
    args = ORPOConfig(
    train_dataset = dataset["train"],
    tokenizer = tokenizer,

We start by importing the ORPOTrainer and ORPOConfig from the trl library. Then we set the parameters inside the ORPOTrainer.

These include:

  • output_dir: Here we specify the output directory where to store the LoRA adapters.
  • max_prompt_length: Defines the maximum prompt length. This is set to 512
  • max_length: This defines the maximum length of the sequence. It is set to 1024
  • logging_steps: We set this to 1, so we can see the logs, like the training loss every single epoch
  • per_device_train_batch_size: It is the number of batches that we will be training per GPU, and we set this to 2.
  • gradient_accumulation_steps: We set this to 2, accumulating gradients every 2 steps before updating them.
  • remove_unused_columns: Will remove the null columns if present in the dataset if set to True
  • optim: Here we define the optimizer we want to work with while training. We will work with the paged_adamw_8bit optimizer.
  • lr_scheduler_type: This tells the type of learning rate scheduler to work with. We go with cosine
  • beta: It is the hyperparameter for the ORPO loss. 0.1 is the recommended value.
  • We set the gradient_checkpointing to True.
  • We set fp16 to True, because the GPU we are working on will support it, and because we do not have any evaluation data, we set the do_eval=False and we train for 1 full epoch.

So, we pass this ORPOConfig, which is the training argument to the ORPOTrainer along with the dataset and the tokenizer. Running this code will create the ORPOTrainer and is ready to start the training step.

Step4: Initiate Training

We will initiate the training with the following code.

Odds Ratio Preference Optimization
Odds Ratio Preference Optimization
Odds Ratio Preference Optimization

Calling the .train() on the orpo_trainer will start the training process. We can see in the pic that we get the training metrics like the training loss, rewards/chosen, rewards/rejected, and so on. There are a total of 247 steps that were taken to complete one epoch of training on the entire dataset. In the second pic, we can see that as the number of steps increased, the training loss has come down.

The odds_ratio in the third picture fluctuates, but overall increases with the number of steps. This indicates a higher probability of generating chosen responses compared to rejected ones, allowing for alignment tuning on a Large Language Model using ORPO or Odds Ratio Preference Optimization.


Odds Ratio Preference Optimization (ORPO) presents a promising approach to efficiently fine-tune large language models like Llama 3 by combining Supervised Fine-Tuning and Preference Optimization in a single step. By introducing an odds ratio term in the training loss, ORPO effectively balances the selection of preferred outputs over rejected ones, all while eliminating the need for a separate reward model. This streamlined approach not only reduces the training time and computational resources required but also leads to a more coherent and efficient model. ORPO demonstrates its potential in aligning language models more closely with human preferences, optimizing their ability to generate high-quality, relevant responses in various applications.

Key Takeaway

  • ORPO combines Supervised Fine-Tuning and Preference Optimization into a single training step, significantly reducing the time and resources required to train large language models.
  • By incorporating an odds ratio term in the training loss, ORPO guides the model towards preferred responses while avoiding rejected ones, thus enhancing the quality of generated text.
  • ORPO has the capability to apply to various large language models, such as Llama 3, showcasing its potential to enhance the training process for a range of NLP tasks and applications.
  • Integrating ORPO into existing training workflows becomes easy using libraries such as unsloth and trl, thereby streamlining the training process.
  • The combination of negative log-likelihood and ORPO loss allows the model to converge toward more suitable responses based on the chosen and rejected sequences.

Frequently Asked Questions

Q1. What is ORPO in the context of Large Language Models?

A. ORPO stands for Odds Ratio Preference Optimization, a method that combines supervised fine-tuning and preference optimization in a single step for efficient training

Q2. Why is ORPO beneficial for training Large Language Models (LLMs)?

A. ORPO reduces both training time and computing resources by combining two fine-tuning steps, which streamlines the process and eliminates the need for a separate reward model

Q3. How does ORPO differ from other preference optimization methods like PPO or DPO?

A. ORPO eliminates the need for a reward model and integrates the odds ratio in the training loss to steer models toward chosen responses and away from rejected ones

Q4. What is the main advantage of using ORPO for LLMs?

A. The main advantage is the reduction in training time and computational resources needed, allowing more efficient fine-tuning of large language models

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ajay Kumar Reddy 02 May, 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers