Parameter-Efficient Fine-Tuning of Large Language Models with LoRA and QLoRA

Shalini Dhote 27 Aug, 2023 • 6 min read

Overview

As we delve deeper into the world of Parameter-Efficient Fine-Tuning (PEFT), it becomes essential to understand the driving forces and methodologies behind this transformative approach. In this article, we will explore how PEFT methods optimize the adaptation of Large Language Models (LLMs) to specific tasks. We will unravel the advantages and disadvantages of PEFT, delve into the intricate categories of PEFT techniques, and decipher the inner workings of two remarkable techniques: Low-Rank Adaptation (LoRA) and Quantized Low-Rank Adaptation (QLoRA). This journey aims to equip you with a comprehensive understanding of these techniques, enabling you to harness their power for your language processing endeavors.

LoRA and QLoRA | Large Language Models

Learning Objectives:

  • Understand the concept of pretrained language models and fine-tuning in NLP.
  • Explore the challenges posed by computational and memory requirements in fine-tuning large models.
  • Learn about Parameter-Efficient Fine-Tuning (PEFT) techniques such as LORA and QLORA.
  • Discover the advantages and disadvantages of PEFT methods.
  • Explore various PEFT methods, including T-Few, AdaMix, and MEFT.
  • Understand the working principles of LORA and QLORA.
  • Learn how QLORA introduces quantization to enhance parameter efficiency.
  • Explore practical examples of fine-tuning using LORA and QLORA.
  • Gain insights into the applicability and benefits of PEFT techniques.
  • Understand the future prospects of parameter-efficient fine-tuning in NLP.

Introduction

In the exciting world of natural language processing, large-scale pre-trained language models (LLMs) have revolutionized the field. However, fine-tuning such enormous models on specific tasks has proven challenging due to the high computational costs and storage requirements. Researchers have delved into Parameter-Efficient Fine-Tuning (PEFT) techniques to achieve high task performance with fewer trainable parameters to address this.

Pretrained LLMs and Fine-Tuning

Pretrained LLMs are language models trained on vast amounts of general-domain data, making them adept at capturing rich linguistic patterns and knowledge. Fine-tuning involves adapting these pretrained models to specific downstream tasks, thus leveraging their knowledge to excel at specialized tasks. Fine-tuning involves training the pretrained model on a task-specific dataset, typically smaller and more focused than the original training data. During fine-tuning, the model’s parameters are adjusted to optimize its performance for the target task.

Parameter Efficient Fine-Tuning (PEFT)

PEFT methods have emerged as an efficient approach to fine-tune pretrained LLMs while significantly reducing the number of trainable parameters. These techniques balance computational efficiency and task performance, making it feasible to fine-tune even the largest LLMs without compromising on quality.

PEFT | LoRA and QLoRA | Large Language Models

Advantages and Disadvantages of PEFT

PEFT brings several practical benefits, such as reduced memory usage, storage cost, and inference latency. It allows multiple tasks to share the same pre-trained model, minimizing the need for maintaining independent instances. However, PEFT might introduce additional training time compared to traditional fine-tuning methods, and its performance could be sensitive to hyperparameter choices.

Types of PEFT

Various PEFT methods have been developed to cater to different requirements and trade-offs. Some notable PEFT techniques include T-Few, which attains higher accuracy with lower computational cost, and AdaMix. This general method tunes a mixture of adaptation modules for better performance across different tasks.

Exploring Different PEFT Methods

Let’s delve into the details of some prominent PEFT methods-

  • T-Few: This method uses (IA)3, a new PEFT approach that rescales inner activations with learned vectors. It achieves super-human performance and uses significantly fewer FLOPs during inference than traditional fine-tuning.
  • AdaMix: A general PEFT method that tunes a mixture of adaptation modules, like Houlsby or LoRA, to improve downstream task performance for fully supervised and few-shot tasks.
  • MEFT: A memory-efficient fine-tuning approach that makes LLMs reversible, avoiding caching intermediate activations during training and significantly reducing memory footprint.
  • QLORA: An efficient fine-tuning technique that uses low-rank adapters injected into each layer of the LLM, greatly reducing the number of trainable parameters and GPU memory requirement.

Low-Rank Adaptation (LoRA)

LoRA is an innovative technique designed to efficiently fine-tune pre-trained language models by injecting trainable low-rank matrices into each layer of the Transformer architecture. LoRA aims to reduce the number of trainable parameters and the computational burden while maintaining or improving the model’s performance on downstream tasks.

How LoRA Works?

  1. Starting Point Preservation: In LoRA, the starting point hypothesis is crucial. It assumes that the pretrained model’s weights are already close to the optimal solution for the downstream tasks. Thus, LoRA freezes the pretrained model’s weights and focuses on optimizing trainable low-rank matrices instead.
  2. Low-Rank Matrices: LoRA introduces low-rank matrices, represented as matrices A and B, into the self-attention module of each layer. These low-rank matrices act as adapters, allowing the model to adapt and specialize for specific tasks while minimizing the number of additional parameters needed.
  3. Rank-Deficiency: An essential insight behind LoRA is the rank-deficiency of weight changes (∆W) observed during adaptation. This suggests that the model’s adaptation involves changes that can be effectively represented with a much lower rank than the original weight matrices. LoRA leverages this observation to achieve parameter efficiency.

Advantages of LoRA

  1. Reduced Parameter Overhead: Using low-rank matrices instead of fine-tuning all parameters, LoRA significantly reduces the number of trainable parameters, making it much more memory-efficient and computationally cheaper.
  2. Efficient Task-Switching: LoRA allows the pretrained model to be shared across multiple tasks, reducing the need to maintain separate fine-tuned instances for each task. This facilitates quick and seamless task-switching during deployment, reducing storage and switching costs.
  3. No Inference Latency: LoRA’s linear design ensures no additional inference latency compared to fully fine-tuned models, making it suitable for real-time applications.

Quantized Low-Rank Adaptation (QLoRA)

QLoRA is an extension of LoRA that further introduces quantization to enhance parameter efficiency during fine-tuning. It builds on the principles of LoRA while introducing 4-bit NormalFloat (NF4) quantization and Double Quantization techniques.

  • NF4 Quantization: NF4 quantization leverages the inherent distribution of pre-trained neural network weights, usually zero-centered normal distributions with specific standard deviations. By transforming all weights to a fixed distribution that fits within the range of NF4 (-1 to 1), NF4 quantization effectively quantifies the weights without the need for expensive quantile estimation algorithms.
  • Double Quantization: Double Quantization addresses the memory overhead of quantization constants. Double Quantization significantly reduces the memory footprint without compromising performance by quantizing the quantization constants themselves. The process involves using 8-bit Floats with a block size 256 for the second quantization step, resulting in substantial memory savings.

Advantages of QLoRA

  1. Further Memory Reduction: QLoRA achieves even higher memory efficiency by introducing quantisation, making it particularly valuable for deploying large models on resource-constrained devices.
  2. Preserving Performance: Despite its parameter-efficient nature, QLoRA retains high model quality, performing on par or even better than fully fine-tuned models on various downstream tasks.
  3. Applicability to Various LLMs: QLoRA is a versatile technique applicable to different language models, including RoBERTa, DeBERTa, GPT-2, and GPT-3, enabling researchers to explore parameter-efficient fine-tuning for various LLM architectures.

Fine-Tuning Large Language Models Using PEFT

Let’s put these concepts into practice with a code example of fine-tuning a large language model using QLORA.

# Step 1: Load the pre-trained model and tokenizer
from transformers import BertTokenizer, BertForMaskedLM, QLORAdapter

model_name = "bert-base-uncased"
pretrained_model = BertForMaskedLM.from_pretrained(model_name)
tokenizer = BertTokenizer.from_pretrained(model_name)

# Step 2: Prepare the dataset
texts = ["[CLS] Hello, how are you? [SEP]", "[CLS] I am doing well. [SEP]"]
train_encodings = tokenizer(texts, truncation=True, padding="max_length", return_tensors="pt")
labels = torch.tensor([tokenizer.encode(text, add_special_tokens=True) for text in texts])

# Step 3: Define the QLORAdapter class
adapter = QLORAdapter(input_dim=768, output_dim=768, rank=64)
pretrained_model.bert.encoder.layer[0].attention.output = adapter

# Step 4: Fine-tuning the model
optimizer = torch.optim.AdamW(adapter.parameters(), lr=1e-5)
loss_fn = nn.CrossEntropyLoss()

for epoch in range(10):
    optimizer.zero_grad()
    outputs = pretrained_model(**train_encodings.to(device))
    logits = outputs.logits
    loss = loss_fn(logits.view(-1, logits.shape[-1]), labels.view(-1))
    loss.backward()
    optimizer.step()

# Step 5: Inference with the fine-tuned model
test_text = "[CLS] How are you doing today? [SEP]"
test_input = tokenizer(test_text, return_tensors="pt")
output = pretrained_model(**test_input)
predicted_ids = torch.argmax(output.logits, dim=-1)
predicted_text = tokenizer.decode(predicted_ids[0])
print("Predicted text:", predicted_text)

Conclusion

Parameter-efficient fine-tuning of LLMs is a rapidly evolving field that addresses the challenges posed by computational and memory requirements. Techniques like LORA and QLORA demonstrate innovative strategies to optimize fine-tuning efficiency without sacrificing task performance. These methods offer a promising avenue for deploying large language models in real-world applications, making NLP more accessible and practical than ever before.

Frequently Asked Questions

Q1: What is the goal of parameter-efficient fine-tuning?

A: The goal of parameter-efficient fine-tuning is to adapt pre-trained language models to specific tasks. While minimizing traditional fine-tuning methods’ computational and memory burden.

Q2: How does Quantized Low-Rank Adaptation (QLoRA) enhance parameter efficiency?

A: QLoRA introduces quantization to the low-rank adaptation process, effectively quantifying weights without complex quantization techniques. This enhances memory efficiency while preserving model performance.

Q3: What are the advantages of Low-Rank Adaptation (LoRA)?

A: LoRA reduces parameter overhead, supports efficient task-switching, and maintains inference latency, making it a practical solution for parameter-efficient fine-tuning.

Q4: How can researchers benefit from PEFT techniques?

A: PEFT techniques enable researchers to fine-tune large language models efficiently. Optimizing their utilization in various downstream tasks without sacrificing computational resources.

Q5: Which language models can benefit from QLoRA?

A: QLoRA applies to various language models, including RoBERTa, DeBERTa, GPT-2, and GPT-3, providing parameter-efficient fine-tuning options for different architectures.
As the field of NLP continues to evolve. The parameter-efficient fine-tuning techniques like LORA and QLORA pave the way for more accessible and practical deployment of LLMs across diverse applications.

Shalini Dhote 27 Aug 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers