Fine-tuning A Tiny-Llama Model with Unsloth

Sunil Kumar Last Updated : 02 Feb, 2024

8 min read

Introduction

After the Llama and Mistral models were released, the open-source LLMs took the limelight out of OpenAI. Since then, multiple models have been released based on Llama and Mistral architecture, performing on par with proprietary models like GPT-3.5 Turbo, Claude, Gemini, etc. However, these models are too large to be used in consumer hardware.

But lately, there has been an emergence of a new class of LLMs. These are the LLMs in the sub-7B parameter category. Fewer parameters make them compact enough to be run in consumer hardware while keeping efficiency comparable to the 7B models. Models like Tiny-Llama-1B, Microsoft’s Phi-2, and Alibaba’s Qwen-3b can be great substitutes for larger models to run locally or deploy on edge. At the same time, fine-tuning is crucial to bring the best out of any base model for any downstream tasks.
Here, we will explore how to Fine-tune a base Tiny-Llama model on a cleaned Alpaca dataset.

Learning Objectives

Understand fine-tuning and different methods of it.
Learn about tools and techniques for efficient fine-tuning.
Learn about WandB for logging training logs.
Fine-tune Tiny-Llama on the Alpaca dataset in Colab.

This article was published as a part of the Data Science Blogathon.

What is LLM Fine-Tuning?
Fine-Tuning with Unsloth
Logging with WandB
How to Fine-tune Tiny-Llama?
Frequently Asked Questions

What is LLM Fine-Tuning?

Fine-tuning is the process of making a pre-trained model learn new knowledge. The pre-trained model is a general-purpose model trained on a large amount of data. However, in most cases, they fail to perform as intended, and fine-tuning is the most effective way to make the model adapt to specific use cases. For example, base LLMs do well at text generation on single-turn QA but struggle with multi-turn conversations like chat models.

The base models need to be trained on transcripts of dialogues to be able to hold multi-turn conversations. Fine-tuning is essential to mold pre-trained models into different avatars. The quality of Fine-tuned models depends on the quality of data and base model capabilities. There are multiple ways to model fine-tuning, like LoRA, QLoRA, etc.

Let’s briefly go through these concepts.

LoRA

LoRA stands for Low-rank Adaptation, a popular fine-tuning technique in which we select a few trainable parameters instead of updating all the parameters via a low-rank approximation of original weight matrices. The LoRA model can be Fine-tuned faster on less compute-intensive hardware.

QLoRA

QLoRA or Quantized LoRA is a step further than the LoRA. Instead of a full-precision model, it quantizes the model weights to lower floating point precision before applying LoRA. Quantization is the process of downcasting higher bit values to lower values. A 4-bit quantization process involves quantizing the 16-bit weights to 4-bit float values.

Quantizing the model leads to a substantial reduction in model size with comparable accuracy to the original model. In QLoRA, we take a quantized model and apply LoRA to it. The models can be quantized in multiple ways, such as through llama.cpp, AWQ, bitsandbytes, etc.

Fine-Tuning with Unsloth

Unsloth is an open-source platform for fine-tuning popular Large Language Models faster. It supports popular LLMs, including Llama-2 and Mistral, and their derivatives like Yi, Open-hermes, etc. It implements custom triton kernels and a manual back-prop engine to improve the speed of the model training.

Here, we will use the Unsloth to Fine-tune a base 4-bit quantized Tiny-Llama model on the Alpaca dataset. The model is quantized with bits and bytes, and kernels are optimized with OpenAI’s Triton.

Logging with WandB

In Machine learning, it is crucial to log training and evaluation metrics. This gives us a complete picture of the train run. Weights and Biases (WandB) is an open-source library for visualizing and tracking machine learning experiments. It has a dedicated web app for visualizing training metrics in real-time. It also lets us manage production models centrally. We will use WandB only to track our Tiny-Llama fine-tuning run.

To use WandB, sign up for a free account and create an API key.

Now, let’s start fine-tuning our model.

How to Fine-tune Tiny-Llama?

Fine-tuning is a compute-heavy task. It requires a machine with 10-15 GB of VRAM, or you can use Colab’s free Tesla T4 GPU runtime.

Now install Unsloth and WandB

%%capture
import torch
major_version, minor_version = torch.cuda.get_device_capability()
!pip install wandb
if major_version >= 8:
    # Use this for new GPUs like Ampere, Hopper GPUs (RTX 30xx, RTX 40xx, A100, H100, L40)
    !pip install "unsloth[colab_ampere] @ git+https://github.com/unslothai/unsloth.git"
else:
    # Use this for older GPUs (V100, Tesla T4, RTX 20xx)
    !pip install "unsloth[colab] @ git+https://github.com/unslothai/unsloth.git"
pass

The next thing is to load the 4-bit quantized pre-trained model with Unsloth.

from unsloth import FastLanguageModel
import torch
max_seq_length = 4096 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/tinyllama-bnb-4bit", # "unsloth/tinyllama" for 16bit loading
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)

This will install the model locally. The 4-bit model size will be around 760 MBs.

Now apply PEFT to the 4-bit Tiny-Llama model.

model = FastLanguageModel.get_peft_model(
    model,
    r = 32, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,
    lora_dropout = 0, # Currently only supports dropout = 0
    bias = "none",    # Currently only supports bias = "none"
    use_gradient_checkpointing = True, # @@@ IF YOU GET OUT OF MEMORY - set to True @@@
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Prepare Data

The next step is to prepare the dataset for fine-tuning. As I mentioned earlier, we will use a cleaned Alpaca dataset. This is a cleaned version of the original Alpaca dataset. It follows the instruction-input-response format. Here is an example of Alpaca data

Now, let’s prepare our data.

@title prepare data

#alpaca_prompt = """Below is an instruction that describes a task, paired with an input that
 provides further context.
 Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

Now, split the data into train and eval data. I have taken small eval data as larger eval data slows down the training.

dataset_dict = dataset.train_test_split(test_size=0.004)

Configure WandB

Now, configure Weights and Biases in your current runtime.

# @title wandb init
import wandb
wandb.login()

Provide API key to log in to WandB when prompted.

Set up environment variables.

%env WANDB_WATCH=all
%env WANDB_SILENT=true

Train Model

So far, we have loaded the 4-bit model, created the LoRA configuration, prepared the dataset, and configured WandB. The next step is to train the model on the data. For that, we need to define a trainer from the Trl library. We will use the SFTrainer from Trl. But before that, initialize WandB and define appropriate training arguments.

import os

from trl import SFTTrainer
from transformers import TrainingArguments
from transformers.utils import logging
import wandb

logging.set_verbosity_info()
project_name = "tiny-llama" 
entity = "wandb"
# os.environ["WANDB_LOG_MODEL"] = "checkpoint"

wandb.init(project=project_name, name = "tiny-llama-unsloth-sft")

Training Arguments

args = TrainingArguments(
        per_device_train_batch_size = 2,
        per_device_eval_batch_size=2,
        gradient_accumulation_steps = 4,
        evaluation_strategy="steps",
        warmup_ratio = 0.1,
        num_train_epochs = 1,
        learning_rate = 2e-5,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        optim = "adamw_8bit",
        weight_decay = 0.1,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
        report_to="wandb",  # enable logging to W&B
        # run_name="tiny-llama-alpaca-run",  # name of the W&B run (optional)
        logging_steps=1,  # how often to log to W&B
        logging_strategy = 'steps',
        save_total_limit=2,
    )

This is important for training. To keep GPU usage low, keep the train, eval batch, and gradient accumulating steps low. The logging_steps is the number of steps before metrics are logged to WandB.

Now, initialize the SFTTrainer.

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset_dict["train"],
    eval_dataset=dataset_dict["test"],
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True, # Packs short sequences together to save time!
    args = args,
)

Now, start the training.

trainer_stats = trainer.train()
wandb.finish()

During the training run, WandB will track the training and eval metrics. You visit the given dashboard link and see it in real-time.

This is a screenshot from my run on a Colab notebook.

The training speed will depend on multiple factors, including the training and eval data sizes, train and eval batch size, and the number of epochs. If you encounter GPU usage issues, try reducing batch and gradient accumulation step sizes. The train batch size = batch_size_per_device * gradient_accumulation_steps. And the number of optimization steps = total training data/batch size. You can play with the parameters and see which works better.

You can visualize the training and evaluation loss of your training on the WandB dashboard.

Train Loss

Eval Loss

Inferencing

You can save the LoRA adapters locally or push them to the HuggingFace Repository.

model.save_pretrained("lora_model") # Local saving
# model.push_to_hub("your_name/lora_model", token = "...") # Online saving

You can also load the saved model from the disk and use it for inferencing.

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )

inputs = tokenizer(
[
    alpaca_prompt.format(
        "capital of France?", # instruction
        "", # input
        "", # output - leave this blank for a generation!
    )
]*1, return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

For streaming Model responses.

from transformers import TextStreamer

text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 64)

So, this was all about fine-tuning a Tiny-Llama model with WandB logging.

Here is the Colab Notebook for the same.

Conclusion

Small LLMs can be beneficial for deploying on compute-restricted hardware, such as personal computers, mobile phones, and other wearables, etc. Fine-tuning allows these models to perform better on downstream tasks. In this article, we learned how to Fine-tune a base language model on a dataset.

Key Takeaways

Fine-tuning is the process of making a pre-trained model adapt to a specific new task.
Tiny-Llama is an LLM with only 1.1 billion parameters and is trained on 3 trillion tokens.
There are different ways to Fine-tune LLMs, like LoRA and QLoRA.
Unsloth is an open-source platform that provides CUDA-optimized LLMs to speed up fine-tuning LLMs.
Weights and Biases (WandB) is a tool for tracking and storing ML experiments.

Frequently Asked Questions

Q1. What is LLM fine-tuning?

A. Fine-tuning, in the context of machine learning, especially deep learning, is a technique where you take a pre-trained model and adapt it to a new, specific task.

Q2. Can I Fine-tune LLMs for free?

A. It is possible to Fine-tune smaller LLMs for free on Colab over the Tesla T4 GPU with QLoRA.

Q3. What are the benefits of Fine-tuning LLM?

A. Fine-tuning vastly enhances LLM’s capability to perform downstream tasks, like role play, code generation, etc.

Q4. What is Tiny-Llama?

A. Tiny-Llama trained on 3 trillion tokens is an LLM with 1.1B parameters. The model adopts the original Llama-2 architecture.

Q5. What is Unsloth used for?

A. Unsloth is an open-source tool that provides faster and more efficient LLM fine-tuning by optimizing GPU kernels with Triton.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sunil Kumar

Meet your author Sunil kumar Dash, a developer and a writer. Has diverse interests in tech, pop culture, wellness, philosophy and Anime. Exploring underrated music is his hobby. And loves to doom scroll Twitter when bored.

Advanced Large Language Models

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Fine-tuning A Tiny-Llama Model with Unsloth

Introduction

Learning Objectives

Table of contents

What is LLM Fine-Tuning?

LoRA

QLoRA

Fine-Tuning with Unsloth

Logging with WandB

How to Fine-tune Tiny-Llama?

Prepare Data

Configure WandB

Train Model

Inferencing

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg