TinyLlama 1.1B – Size Doesn’t Matter

Ajay Last Updated : 16 Feb, 2024

10 min read

Introduction

In the quickly growing landscape of artificial intelligence and machine learning, TinyLlama 1.1B emerges as a noteworthy development. In an era where computational constraints pose challenges for running more complex models, TinyLlama stands out by defying expectations. It showcases the remarkable performance of compact models.

This article aims to provide an analysis of TinyLlama 1.1B, a compact large language model. We will delve into its core aspects, like how it was trained in performance benchmarks and practical implementation using the Hugging Face platform. We will even run this model on the free Google Colab and test its maths and reasoning abilities.

Learning Objectives

Gain a comprehensive understanding of TinyLlama 1.1B
Explore the intricate training process that the model has gone through
Analyze the performance and benchmark results to assess its efficacy
Learn the practical steps to implement TinyLlama 1.1B using coding examples

This article was published as a part of the Data Science Blogathon.

Learning Objectives
What is TinyLlama 1.1B?
Training Process of TinyLlama 1.1B
Performance and Benchmark Results
TinyLlama – Getting Started
Potential Use Cases and Applications
The Key Takeaways From this Article Include
Frequently Asked Questions

What is TinyLlama 1.1B?

TinyLlama 1.1B, a part of the broader Llama project, is a testament to language modeling advancements. It’s a model with 1.1 billion parameters, trained on a staggering 3 trillion tokens, which puts it in a unique position in the AI landscape. Unlike its larger counterparts, TinyLlama 1.1B is designed to be more efficient and manageable, making it a good choice for applications with limited computational resources.

This open-source model democratizes access to state-of-the-art AI technology, allowing many developers and researchers to explore and innovate in the field of natural language processing. It is a model known for its ability to balance performance with resource consumption, a critical consideration in today’s diverse computational environments.

Training Process of TinyLlama 1.1B

The training process of TinyLlama 1.1B is fascinating, like the model itself. The training of TinyLlama took place just for 90 days, trained on the 16 A100-40G GPUs.The pretraining was done on 3 Trillion Tokens, and the TinyLlama Team has published the intermediate model between each half a trillion.

As for the data, Slimpajama and Starcoderdata were taken with a combined dataset size of 950 Billion Tokens. The natural language-to-code ratio was kept at 7:3, i.e. 70% of the data was natural language, and 30% was code. Thus, to achieve the 3 Trillion Tokens mark for fine-tuning, the TinyLlama underwent 3 epochs of training for this dataset.

There is even a chat version of TinyLlama called the TinyLlama-Chat released. Initially, this model underwent fine-tuning on the UltraChat dataset, which contains diverse synthetic conversations generated by ChatGPT. This step was crucial in making the model to handle different conversational contexts and styles.

Further refinement was achieved using the DPOTrainer on the UltraFeedback dataset. This training phase focused on aligning the model’s responses to align with human-like conversational patterns. The result is a model that not just grasps information on different topics but even interacts in a natural and engaging way.

You can also read: Getting Started with LlaMA 2: A Beginner’s Guide

Performance and Benchmark Results

Evaluating the performance of TinyLlama 1.1B reveals its capability to deliver high-quality responses swiftly. Its training has endowed it with the ability to cater to multilingual applications, an important feature in our globalized world. Despite its smaller size, TinyLlama 1.1B is still catching up to its larger counterparts regarding response quality and speed, making it a potent tool in different AI applications.

The benchmarks for TinyLlama 1.1B, while less extensive than those for larger models, still demonstrate its proficiency in handling complex language tasks. Its ability to generate coherent and contextually relevant responses in multiple languages is particularly impressive. The model was tested on different benchmarks like HellaSwag, WinoGrande, ARC, MMLU, and others. The combined average score came out to be 52.99. This is way better than the other 1 Billion Parameter Model, i.e. the Pythia 1B, which achieved an average score of 48.3. The table depicts the individual scores of each benchmark

Benchmark	TinyLlama 1.1B Score
HellaSwag	59.2
Obqa	36.0
WinoGrande	59.12
ARC_c	30.12
ARC_e	55.25
boolq	57.83
piqa	73.29
avg	52.9

TinyLlama – Getting Started

Here, in this section, we will download the quantized version of TinyLlama Chat and run it in Google Colab. Before downloading the model, we have to download and install the following Python Packages

!CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip3 install llama-cpp-python 
!pip3 install huggingface-hub

The CMAKE_ARGS=”-DLLAMA_CUBLAS=on” and FORCE_CMAKE=1, will allow the llama_cpp_python to utilize the Nvidia GPU available in the free colab version.
Then we install the llama_cpp_python package through the pip3
We even download the huggingface-hub, with which we will be downloading the quantized TinyLlama 1.1B Chat

To test the TinyLlama 1.1B Chat model, we need first to download the quantized version of it. To download it, we will run the following code

from huggingface_hub import hf_hub_download

# specifying the model name
model_name = "TheBloke/TinyLlama-1.1B-Chat-v1.0-GGUF"
# specifying the type of quantization of the model
model_file = "tinyllama-1.1b-chat-v1.0.Q8_0.gguf"

# download the model by specifying the model name and quantized model name
model_path = hf_hub_download(model_name, filename=model_file)

Here, the hugging_face_hub library will take care of the process of downloading the quantized model. For this, we import the hf_hub_download that takes in the following parameters:

model_name: To this variable, we pass the model that we wish to download. Here we wish to download the TinyLlama 1.1B Chat GGUF model.
model_file: Here we specify the type of quantized model we want to download. Here we will download the 8-bit quantized version of the TinyLlama 1.1B Chat.
Finally, we pass these parameters to the hf_hub_download, which takes in these parameters and downloads the specified model. After downloading, it returns the path where the model is downloaded.
This path returned is being saved in the model_path variable.

Now, we can load this model through the llama_cpp_python library. The code for loading the model will be like the one below.

from llama_cpp import Llama
llm = Llama(
    model_path=model_path,
    n_ctx=512,  # the number of i/p tokens the model can take
    n_threads=8, # the number of threads to use
    n_gpu_layers=40# how many layers of the model to offload to the GPU
)

We import the Llama class from the llama_cpp, which takes in the following parameters

model_path: This variable takes in the path where our model is stored. We have obtained the path from the previous step, which we will be providing here
n_ctx: Here, we give the context length for the model. For now, we are providing 512 tokens as the context length
n_threads: Here we mention the number of threads to be used by the Llama class
n_gpu_layers: We specify this if we have a running GPU, which we do in case of the free colab. To this, we pass 40, which implies that we want to offload the entire model into the GPU and do not want any part of it to run in the system RAM
Finally, we create an object from this Llama class and give it to the variable llm

Running this code will load the TinyLlama 1.1B Chat quantized model onto the GPU and set the appropriate context length. Now, it’s time to perform some inferences on this model. For this, we work with the below code

output = llm(
  "<|im_start|>user\nWho are you?<|im_end|>\n<|im_start|>assistant\n", # User Prompt
  max_tokens=512,  # Number of output tokens generated
  stop=["</s>"],   # Token which tells the LLM to stop
)
print(output['choices'][0]['text']) # Model generated text

To infer the model, we pass the following parameters to the LLM:

prompt/chat template: This is the Prompt Template needed to chat with the model. The above-mentioned template(i.e. <im_end>, <im_start>) is the one that works for the TinyLlama 1.1B Chat model. In the template, the sentence after the User is the User Prompt, and the generation will be generated after the Assistant.
max_tokens: To this variable, we pass a value that defines the maximum number of tokens a Large Language Model can output when a Prompt is given. For now, we are limiting it to 512 tokens.
stop: To this variable, we pass the stop token. The stop token tells the Large Language Model to stop generating further tokens. For TinyLlama 1.1B Chat, the stop token is <s>

The generated text is stored in the output variable when we run this. The result is generated in a format similar to the OpenAI API call. Hence, we can access the generation through the given print statement, similar to how we access the generation from the OpenAI responses. The output generated can be seen below

For a model of this size, its generated response is top-notch. This is unexpected from a model of this size; the grammar and tone look perfectly fine, and there is no sign of repetition of sentences. Let’s try testing the model’s reasoning capabilities

output = llm(
  "<|im_start|>user\nIf all students who study hard get good grades, \
  and John got good grades, can we conclude that John studied hard?\
  <|im_end|>\n<|im_start|>assistant\n",
  max_tokens=512,
  stop=["</s>"],
)


print(output['choices'][0]['text'])

output = llm(
  "<|im_start|>user\nHow fast can a snake fly?\n<|im_end|>\n<|im_start|>assistant\n",
  max_tokens=512,
  stop=["</s>"],
)


print(output['choices'][0]['text'])

So far, so good. From the examples we have seen, the model generates good answers. But this may not be true in all cases because we only test it on a limited number of questions. Let’s even test the model on its math reasoning capabilities

output = llm(
  "<|im_start|>user\nJohn is twice as old as Sarah, and Sarah is three years \
  older than Mary. If Mary is 10 years old, how old is John?\n<|im_end|>\n<|im_start|>assistant\n",
  max_tokens=512,
  stop=["</s>"],
)


print(output['choices'][0]['text'])

output = llm(
  "<|im_start|>user\nWhat is the missing number in this pattern: \
  1, 4, 9, 16, __, 36?\n<|im_end|>\n<|im_start|>assistant\n",
  max_tokens=512,
  stop=["</s>"],
)


print(output['choices'][0]['text'])

From the examples we have seen, it is clear that the TinyLlamaChat performs extremely poorly in answering simple aptitude questions in math. This is expected because the model was not pretrained on any maths dataset. The quality of the generation can be improved by fine-tuning it on the math dataset

Coming to fine-tuning, the TinyLlama is a go-to choice for those who are restricted with limited hardware and wish to fine-tune large language models on their specific dataset

Potential Use Cases and Applications

Given the compact size of TinyLlama, which boasts 1.1 billion parameters, its applications are mainly suited to environments where larger models might not be as feasible due to hardware limitations or greater efficiency. Here are some specific use cases keeping its size in consideration:

Mobile Applications: TinyLlama’s smaller size makes it a good choice for integrating into mobile apps where on-device processing is necessary. This includes language translation apps, personal assistant features, and chatbots that can operate efficiently on smartphones.

Embedded Systems in IoT Devices: In the Internet of Things (IoT) field, the computing resources are often limited; TinyLlama can be used to add intelligent language processing capabilities to different equipment like smart home assistants, wearable tech, and other such connected equipment.

Edge Computing: For applications that benefit from processing data closer to the source rather than in a centralized cloud environment, TinyLlama can be employed effectively. This includes real-time language processing in automotive systems, manufacturing equipment, and other edge devices.

Low-Resource Language Research: Due to its smaller size and lower computational requirements, TinyLlama can be a valuable tool in linguistic research, especially for under-resourced languages where large-scale model training isn’t feasible.

Educational Tools: In educational settings, especially those with limited access to high-end computing resources, TinyLlama can be used to develop language learning apps, interactive educational tools, and other learning aids.

Content Generation for Small Businesses: Small businesses with limited resources can use TinyLlama for generating content, like product descriptions, marketing copy, and customer correspondence, without the need for extensive computing power.

Prototyping and Experimentation: Developers and researchers who wish to experiment with language models but lack access to high-powered computing resources can use TinyLlama to prototype and develop new NLP applications.

Efficient Data Analysis: TinyLlama can be used for text analysis and data extraction in scenarios where quick and efficient processing is needed, like analyzing customer feedback, survey responses, or social media interactions.

Conclusion

TinyLlama 1.1B is a testament to the advancements in the field of AI and natural language processing. Its development and widespread availability are vital to creating more efficient, small, and quick inference language models. By balancing a smaller parameter footprint with robust performance, TinyLlama 1.1B addresses the critical need for powerful and practical models for a wide array of applications. Its ability to understand and generate language in a human-like manner while being light enough for different computing environments makes it a go-to choice for people struggling to run Large Language Models on their machines. The model can be fine-tuned easily on a dataset and can be trained with limited computing resources.

The Key Takeaways From this Article Include

Designed for efficiency, TinyLlama 1.1B is available to a wider audience, including those with limited computational resources, making it suitable for several applications.
The model underwent an extensive training process, including training on 3 trillion tokens over 90 days using 16 A100-40G GPUs.
Despite its smaller size, TinyLlama 1.1B delivers high-quality, contextually relevant responses in multiple languages, making it a model to consider.
It is a good choice for mobile applications, IoT equipment, educational tools, and more, its compact size and efficiency allow for broad applications.
Its lower computational requirements make it a valuable tool in linguistic research, especially for under-resourced languages.
The model is a good choice for those experimenting with language models or developing new NLP Apps, mainly in settings with limited computational power.

Frequently Asked Questions

Q1. What is TinyLlama 1.1B?

A. TinyLlama 1.1B is a compact, efficient large language model with 1.1 billion parameters, trained on 3 trillion tokens, suitable for applications with limited computational resources.

Q2. How was TinyLlama 1.1B trained?

A. It was trained over 90 days using 16 A100-40G GPUs on datasets including Slimpajama and Starcoderdata, with a natural language to code ratio of 7:3.

Q3. What are the performance benchmarks of TinyLlama 1.1B?

A. TinyLlama 1.1B shows its skills in handling complex language tasks, scoring an average of 52.99 across benchmarks like HellaSwag, MMLU, and WinoGrande.

Q4. What are some potential use cases of TinyLlama 1.1B?

A. It’s suitable for applications where size and speed are an important issue. These include mobile apps, IoT equipment like home automation devices, content generation for small businesses, and efficient data analysis.

Q5. Is TinyLlama 1.1B suitable for developers with limited resources?

A. Absolutely, it’s a perfect choice for developers and researchers who lack access to high-powered computing resources for prototyping and developing new NLP applications. The TinyLlama model can be even run on a Raspberry Pi machine.

Q6. How does TinyLlama 1.1B perform in mathematical reasoning tasks?

A. While it really excels in different language tasks, it shows limitations in mathematical reasoning, which can be improved by fine-tuning relevant datasets.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Ajay

I work as a Developer in the field of Data Science. I constantly spend time learning new things be it related to AI, DataSceine, and CyberSecurity. Deep learning and machine learning are two topics that I find particularly fascinating, and Python is my preferred language for programming. Cyber Security is another field that I'm touching upon recently. I have experience with large-scale data analysis, and I have a solid grasp of a variety of deep learning and machine learning approaches, including neural networks, regression models, and natural language processing. I'm eager to take on new challenges and make a meaningful contribution to the industry, so I'm constantly seeking for ways to enlarge and deepen my knowledge and skills in the subject.

Artificial Intelligence Data Analysis Guide LLMs NLP

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

TinyLlama 1.1B – Size Doesn’t Matter

Introduction

Learning Objectives

Table of contents

What is TinyLlama 1.1B?

Training Process of TinyLlama 1.1B

Performance and Benchmark Results

TinyLlama – Getting Started

Potential Use Cases and Applications

Conclusion

The Key Takeaways From this Article Include

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)