How to Run Llama 3 Locally?

Sunil Kumar Last Updated : 02 Apr, 2025

6 min read

Meta’s Llama 3 models bring exciting improvements like a larger vocabulary and better performance. This article explains how they work, compares them to other models, and shows you how to use them on your own devices with tools like HuggingFace and Ollama. You’ll also learn about their open-source design, new features like Grouped Query Attention, and how they can be used for text generation and other tasks. In this article you will get to know all about the Llama 3 , and how to run it Locally.

This article was published as a part of the Data Science Blogathon.

Meta’s Llama 3
What is Ollama?
Introduction of Llama 3
Running Llama 3 Locally
- Using HuggingFace
- Using Llama 3 With Ollama
Conclusion
Frequently Asked Questions

Meta’s Llama 3

Meta’s Llama 3 is a large language model (LLM) that they released in 2024. Here’s a summary of what makes it special:

Most Capable Open-Source LLM: Meta claims Llama 3 outperforms other similar sized open-source models on benchmarks. [1]
Powers Meta AI Assistant: This AI assistant is integrated into Facebook, Messenger, WhatsApp and Instagram and can help with tasks, learning and content creation .
Easy to Access: You can try Llama 3 through Meta AI or through platforms like Hugging Face

What is Ollama?

Ollama is an open-source framework designed to make working with Large Language Models (LLMs) easier. It allows you to run these powerful AI models directly on your own computer.

Here are some key features of Ollama:

Run LLMs locally: Ollama lets you bypass cloud-based services and run LLMs on your local machine. This can be beneficial for privacy reasons and when dealing with sensitive data.
Simple API: Ollama provides an easy-to-use interface for creating, running, and managing LLMs.
Pre-built models: Ollama comes with a library of pre-built models that you can use right away for various tasks.
Customization: Although it offers pre-built models, Ollama also allows you to import your own custom models for even greater flexibility.

Overall, Ollama is a valuable tool for developers, data scientists, and researchers who want to work with LLMs on their local machines. It simplifies the process and offers a secure environment for experimentation and development.

Read about this also “3 Ways to Use Llama 3

Introduction of Llama 3

Introducing the Llama 3 family: a new era in language models. With pre-trained base and chat models available in 8B and 70B sizes, it brings forth significant advancements. These include an expanded vocabulary size, now at 128k tokens, enhancing token encoding efficiency and enabling better multi-lingual text generation. Additionally, it implements Grouped Query Attention (GQA) across all models, ensuring more coherent and extended responses compared to its predecessors.

Furthermore, Meta’s rigorous training regimen, utilizing 15 trillion tokens for the 8B model alone, signifies a commitment to pushing the boundaries of natural language processing. With plans for multi-modal models and even larger 400B+ models on the horizon, the Llama 3 series heralds a new era of AI language modeling, poised to revolutionize various applications across industries.

You can click here to access model.

Performance Highlights

Llama 3 models excel in various tasks like creative writing, coding, and brainstorming, setting new performance benchmarks.
The 8B Llama 3 model outperforms previous models by significant margins, nearing the performance of the Llama 2 70B model.
Notably, the Llama 3 70B model surpasses closed models like Gemini Pro 1.5 and Claude Sonnet across benchmarks.
Open-source nature allows for easy access, fine-tuning, and commercial use, with models offering liberal licensing.

Running Llama 3 Locally

Llama 3 with all these performance metrics is the most appropriate model for running locally. Thanks to the advancement in model quantization method we can run the LLM’s inside consumer hardware. There are different ways to run these models locally depending on hardware specifications. If your system has enough GPU memory (~48GB), you can comfortably run 8B models with full precision and a 4-bit quantized 70B model. Output might be on the slower side. You may also use cloud instances for inferencing. Here, we will use the free tier Colab with 16GB T4 GPU for running a quantized 8B model. The 4-bit quantized model requires ~5.7 GB of GPU memory, which is fine for running on T4 GPU.

To run these models, we can use different open-source tools. Here are a few tools for running models locally.

Using HuggingFace

HuggingFace has already rolled out support for Llama 3 models. We can easily pull the models from HuggingFace Hub with the Transformers library. You can install the full-precision models or the 4-bit quantized ones. This is an example of running it on the Colab free tier.

Step1: Install Libraries

Install accelerate and bitsandbytes libraries and upgrade the transformers library.

!pip install -U "transformers==4.40.0" --upgrade
!pip install accelerate bitsandbytes

Step2: Install Model

Now we will install the model and start querying.

import transformers
import torch

model_id = "unsloth/llama-3-8b-Instruct-bnb-4bit"

pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    model_kwargs={
        "torch_dtype": torch.float16,
        "quantization_config": {"load_in_4bit": True},
        "low_cpu_mem_usage": True,
    },
)

Step3: Send Queries

Now send queries to the model for inferencing.

messages = [
    {"role": "system", "content": "You are a helpful assistant!"},
    {"role": "user", "content": """Generate an approximately fifteen-word sentence 
                                   that describes all this data:
                                   Midsummer House eatType restaurant; 
                                   Midsummer House food Chinese; 
                                   Midsummer House priceRange moderate; 
                                   Midsummer House customer rating 3 out of 5; 
                                   Midsummer House near All Bar One"""},
]

prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)

print(outputs[0]["generated_text"][len(prompt):])

Output of the query: “Here is a 15-word sentence that summarizes the data:

Midsummer House is a moderate-priced Chinese eatery with a 3-star rating near All Bar One.”

Step4: Install Gradio and Run Code

You can wrap this inside a Gradio to have an interactive chat interface. Install Gradio and run the code below.

import gradio as gr

messages = []

def add_text(history, text):
    global messages  #message[list] is defined globally
    history = history + [(text,'')]
    messages = messages + [{"role":'user', 'content': text}]
    return history

def generate(history):
  global messages
  prompt = pipeline.tokenizer.apply_chat_template(
        messages, 
        tokenize=False, 
        add_generation_prompt=True
)

  terminators = [
    pipeline.tokenizer.eos_token_id,
    pipeline.tokenizer.convert_tokens_to_ids("<|eot_id|>")
]

  outputs = pipeline(
    prompt,
    max_new_tokens=256,
    eos_token_id=terminators,
    do_sample=True,
    temperature=0.6,
    top_p=0.9,
)
  response_msg = outputs[0]["generated_text"][len(prompt):]
  for char in response_msg:
      history[-1][1] += char
      yield history
  pass

with gr.Blocks() as demo:
    
    chatbot = gr.Chatbot(value=[], elem_id="chatbot")
    with gr.Row():
            txt = gr.Textbox(
                show_label=False,
                placeholder="Enter text and press enter",
            )

    txt.submit(add_text, [chatbot, txt], [chatbot, txt], queue=False).then(
            generate, inputs =[chatbot,],outputs = chatbot,)
            
demo.queue()
demo.launch(debug=True)

Here is a demo of the Gradio app and Llama 3 in action.

Using Llama 3 With Ollama

Ollama is another open-source software for running LLMs locally. To use Ollama, you have to download the software.

Step1: Starting Local Server

Once downloaded use this command to start a local server.

ollama run llama3:instruct  #for 8B instruct model

ollama run llama3:70b-instruct #for 70B instruct model

ollama run llama3  #for 8B pre-trained model

ollama run llama3:70b #for 70B pre-trained

Step2: Query Through API

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Step3: JSON Response

You will receive a JSON response.

{
  "model": "llama3",
  "created_at": "2024-04-19T19:22:45.499127Z",
  "response": "The sky is blue because it is the color of the sky.",
  "done": true,
  "context": [1, 2, 3],
  "total_duration": 5043500667,
  "load_duration": 5025959,
  "prompt_eval_count": 26,
  "prompt_eval_duration": 325953000,
  "eval_count": 290,
  "eval_duration": 4709213000
}

Conclusion

We have discovered not just advances in language modeling but also useful implementation strategies of Llama 3. Running Llama 3 locally is now possible because to technologies like HuggingFace Transformers and Ollama, which opens up a wide range of applications across industries. Looking ahead, Llama 3’s open-source design encourages innovation and accessibility, opening the door for a time when advanced language models will be accessible to developers everywhere.

Key Takeaways

Meta has unveiled the Llama 3 family of models containing four models, 8B, and 70B pre-trained and instruction-tuned models.
The models have performed exceedingly well across multiple benchmarks in their respective weight categories.
Llama 3 now uses a different tokenizer than Llama 2 with an increased vocan size. Now all the models are equipped with Grouped Query Attention (GQA) for better text generation.
While the models are big it is possible to run them on consumer hardware using quantization using open-source tools like Ollama and HiggingFace Transformers.

Frequently Asked Questions

Q1. What is Llama 3?

A. Llama 3 is a family of large language models from Meta AI. There are two models 8B and 70B with both a pre-trained base model and an instruction-tuned model for chat application.

Q2. Is Llama 3 open-source?

A. Yes, it is open-source. The model can be deployed commercially and further fine-tuned on custom datasets.

Q3. Is Llama 3 multi-modal?

A. The first batch of these models is not multi-modal but Meta has confirmed the future release of multi-modal models.

Q4. Is Llama 3 better than ChatGPT?

A. The Llama 3 70B model is better than GPT 3.5 but it is still not better than GPT 4.

Q5. Is Llama 3 better than gpt 4?

A. GPT-4 is generally considered more advanced, but LLaMA 3 excels in specific tasks like coding and summarization. Choose based on your needs and preferences.

Sunil Kumar

Meet your author Sunil kumar Dash, a developer and a writer. Has diverse interests in tech, pop culture, wellness, philosophy and Anime. Exploring underrated music is his hobby. And loves to doom scroll Twitter when bored.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Responses From Readers

Sujal Luhar

Hey sunil, When I executed step 3, I am not getting the answer but it is not the complete answer. can we do something for that? Example: messages = [ {"role": "system", "content": "You are a digital marketer who is expert in writing awesome enthusiastic blogs!"}, {"role": "user", "content": """Generate an 15 points long blog that describes about solid pathway to quickly become Machine Learning Engineer """}, ] Response: **Unlock the Power of Machine Learning: A 15-Point Path to Becoming a Machine Learning Engineer** Are you ready to unlock the secrets of machine learning and become a sought-after expert in this high-demand field? Look no further! In this blog, we'll outline a solid pathway to quickly become a machine learning engineer, covering the essential skills, tools, and best practices to get you started. **Point 1: Start with the Basics** Begin by learning the fundamentals of machine learning, including supervised and unsupervised learning, regression, classification, and clustering. **Point 2: Get Familiar with Python** Python is the de facto language for machine learning. Learn the basics of Python programming and get comfortable with popular libraries like NumPy, Pandas, and scikit-learn. **Point 3: Learn Linear Algebra and Calculus** Linear algebra and calculus are crucial for understanding machine learning concepts. Brush up on your math skills and learn to apply them to real-world problems. **Point 4: Dive into Machine Learning Fundamentals** Study the basics of machine learning, including regression, classification, clustering, and dimensionality reduction. Practice implementing these concepts using Python libraries. **Point 5: Experiment with Real-World Datasets** Work with real <>

Sujal Luhar

Hello sunil, thanks for this. But its giving incomplete answers for long texts. EXAMPLE Prompt: ``` messages = [ {"role": "system", "content": "You are a digital marketer who is expert in writing awesome enthusiastic blogs!"}, {"role": "user", "content": """Generate an 15 points long blog that describes about solid pathway to quickly become Machine Learning Engineer """}, ] ``` Response: ``` **Unlock the Power of Machine Learning: A 15-Point Path to Becoming a Machine Learning Engineer** Are you ready to unlock the secrets of machine learning and become a sought-after expert in this high-demand field? Look no further! In this blog, we'll outline a solid pathway to quickly become a machine learning engineer, covering the essential skills, tools, and best practices to get you started. **Point 1: Start with the Basics** Begin by learning the fundamentals of machine learning, including supervised and unsupervised learning, regression, classification, and clustering. **Point 2: Get Familiar with Python** Python is the de facto language for machine learning. Learn the basics of Python programming and get comfortable with popular libraries like NumPy, Pandas, and scikit-learn. **Point 3: Learn Linear Algebra and Calculus** Linear algebra and calculus are crucial for understanding machine learning concepts. Brush up on your math skills and learn to apply them to real-world problems. **Point 4: Dive into Machine Learning Fundamentals** Study the basics of machine learning, including regression, classification, clustering, and dimensionality reduction. Practice implementing these concepts using Python libraries. **Point 5: Experiment with Real-World Datasets** Work with real .....<<>> ``` Can you help me with it?

Reading list

How to Run Llama 3 Locally?

Table of contents

Meta’s Llama 3

What is Ollama?

Introduction of Llama 3

Performance Highlights

Running Llama 3 Locally

Using HuggingFace

Step1: Install Libraries

Step2: Install Model

Step3: Send Queries

Step4: Install Gradio and Run Code

Using Llama 3 With Ollama

Step1: Starting Local Server

Step2: Query Through API

Step3: JSON Response

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

How to Run Llama 3 Locally?

Table of contents

Meta’s Llama 3

What is Ollama?

Introduction of Llama 3

Performance Highlights

Running Llama 3 Locally

Using HuggingFace

Step1: Install Libraries

Step2: Install Model

Step3: Send Queries

Step4: Install Gradio and Run Code

Using Llama 3 With Ollama

Step1: Starting Local Server

Step2: Query Through API

Step3: JSON Response

Conclusion

Key Takeaways

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques