Unlock the Power of GenAI LLMs Right on Your Local Machine!

Ajay Kumar Reddy 06 Sep, 2023

8 min read

Introduction

Since the release of GenAI LLMs, we have started using them in one way or another. The most common way is through websites like the OpenAI website to use ChatGPT or Large Language Models via APIs like OpenAI’s GPT3.5 API, Google’s PaLM API, or through other websites like Hugging Face, Perplexity.ai, which allow us to interact with these Large Language Models.

In all these approaches, our data is sent outside our computer. They may be prone to cyber-attacks (though all these websites assure the highest security, we don’t know what might happen). Sometimes, we want to run these Large Language Models locally and if possible, tune them locally. In this article, we will go through this, i.e., setting up LLMs locally with Oobabooga.

Learning Objectives

Understand the significance and challenges of deploying large language models on local systems.
Learn to create a setup locally to run large language models.
Explore what models can be run with given CPU, RAM, and GPU Vram Specifications.
Learn to download any large language model from Hugging Face to use locally.
Check how to allocate GPU memory for the large language model to run.

This article was published as a part of the Data Science Blogathon.

What is Oobabooga?

Oobabooga is a text-generation web interface for Large Language Models. Oobabooga is a gradio-based web UI. Gradio is a Python library extensively used by Machine Learning enthusiasts to build Web Applications, and Oobabooga was built using this library. Oobabooga abstracts away all the complicated things needed to set up while trying to run a large language model locally. Oobabooga comes with a load of extensions to integrate other features.

With Oobabooga, you can provide the link for the model from Hugging Face, and it will download it, and you start inference the model right away. Oobabooga has many functionalities and supports different model backends like the GGML, GPTQ,exllama, and llama.cpp versions. You can even load a LoRA(Low-Rank Adaptation) with this UI on top of an LLM. Oobabooga lets you train the large language model to create chatbots / LoRAs. In this article, we will go through the installation of this software with Conda.

Setting Up the Environment

In this section, we will be creating a virtual environment using conda. So, to create a new environment, go to Anaconda Prompt and type the following.

conda create -n textgenui python=3.10.9
conda activate textgenui

The first command will create a new conda/Python environment named textgenui. According to the Oobabooga Github’s readme file, they want us to go with the Python 3.10.9 version. The command thus will create a virtual environment with this version.
Then, to activate this environment and make it thement(so we can work on it), we will type the second command to primary environ activate our newly created environment.
The next step is to download the PyTorch library. Now, PyTorch comes in different flavors, like CPU-only version and CPU+GPU version. In this article, we will use the CPU+GPU version, which we will download with the below command.

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu117

PyTorch GPU Python Library

Now, the above command will download the PyTorch GPU Python library. Note that the CUDA(GPU) version we are downloading is cu117. This can change occasionally, so visiting the official Pytorch Page to get the command to download the latest version is advised. And if you have no access to GPU, you can go ahead with the CPU version.

Now change the directory within the anaconda prompt to the directly where you will download the code. Now you can either download it from GitHub or use the git clone command to do it here I will be using the git clone command to clone the Oobabooga’s repository to the directory I want with the below command.

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui

The first command will pull the Oobabooga’s repository to the folder from which we run this command. All the files will be present in a folder called text-generation-uI.
So, we changed the directory to the text-generation-ui using the command in the second line. This directory contains a requirement.txt file, which contains all the necessary packages for the large language models and the UI to work, so we install them through the pip

 pip install -r requirements.txt

The above command will then install all the required packages/libraries, like hugging face, transformers, bitandbytes, gradio, etc., required to run the large language model. We are ready to launch the web UI, which we can do with the below command.

python server.py

Now, in the Anaconda Prompt, you will see that it will show you a URL http://localhost:7860 or http://127.0.0.1:7860. Now go to this URL in your browser, and the UI will appear and will look as follows.:

Setting up the environment | GenAI LLMs | Oobabooga

We have now successfully installed all the necessary libraries to start working with the text-generation-ui, and our next step will be to download the large language models

Downloading and Inferencing Models

In this section, we will download a large language model from the Hugging Face and then try inferencing it and chatting with the LLM. For this, navigate to the Model section present in the top bar of the UI. This will open the model page that looks as follows:

Inferencing models | GenAI LLMs | Oobabooga

Download Custom Model

Here on the right side, we see “Download Custom model or LoRA”; below, we see a text field with a download button. In this text field, we must provide the model’s path from the Hugging Face website, which the UI will download. Let’s try this with an example. For this, I will download the Nous-Hermes model based on the newly released Llama 2. So, I will go to that model card in the Hugging Face, which can be seen below

So I will be downloading a 13B GPTQ model(these models require GPU to run; if you want only the CPU version, then you can go with GGML models), which is the quantized version of the Nous-Hermes 13B model that is based on the Llama 2 model, To copy the path, you can click on the copy button. And now, we need to scroll down to see the different quantized versions of the Nous-Hermes 13B model.

Here, for example, we will choose the gptq-4bit-32g-actorder_True version of the Nous-Hermes-GPTQ model. So now the path for this model will be “TheBloke/Nous-Hermes-Llama2-GPTQ:gptq-4bit-32g-actorder_True”, where the part before the “:” indicates the model name and the part after the “:” indicates the quantized version type of the model. Now, we will paste this into the text box we saw earlier.

Now, we will click on the download button to download the model. This will take some time as the file size is 8GB. After the model is downloaded, click on the refresh button, present to the left of the Load button to refresh. Now select the model you want to use from the drop-down. Now, if the model is CPU version, you can click on the Load button as shown below.

GPU VRAM Model

We must allocate the GPU VRAM from the model if you use a GPU-type model, like the GPTQ one we downloaded here. As the model size is around 8GB, we will allocate around 10GB of memory to it(I have sufficient GPU VRAM, so providing 10 GB). Then, we click on the load button as shown below.

Now, after we click the load button, we go to the Session tab and change the mode. The mode will be changed from default to chat. Then, we click the Apply and restart buttons, as shown in the picture.

Now, we are ready to make inferences with our model, i.e., we can start interacting with the model that we have downloaded. Now go to the Text Generation tab, and it will look something like

So, it’s time to test our Nous-Hermes-13B Large Language Model that we downloaded from Hugging Face through the Text Generation UI. Let’s start the conversation.

HuggingFace through text generation UI | Oobabooga

We can see from the above that the model is indeed working fine. It didn’t do anything too creative, i.e., hallucinate. It rightly answered my questions. We can see that we have asked the large language model to generate a Python code for finding the Fibonacci series. The LLM has written a workable Python code that matches the input that I have given. Along with that, it even gave me an explanation of how it works. This way, you can download and run any model through the Text Generation UI, all of it locally, ensuring the privacy of your data.

Conclusion

In this article, we have gone through a step-by-step process of downloading text-generation-UI, which allows us to interact with the large language models directly within our local environment without being connected to the network. We have looked into how to download models of a specific version from Hugging Face and have learned what quantized methods the current application supports. This way, anyone can access a large language model, even the latest LlaMA 2, which we have seen in this article, a large language model that was based on the newly released LlaMA 2.

Key Takeaways

Some of the key takeaways from this article include:

The text-generation-ui from Oogabooga can be used on any system of any OS, be it Mac, Windows, or Linux.
This UI lets us directly access different large language models, even newly released ones, from Hugging Face.
Even the quantized versions of different large language models are supported by this UI.
CPU-only large language models can also be loaded with this text-generation-UI that allows users with no access to GPU to access the LLMs.
Finally, as we run the UI locally, the data / the chat we have with the model stays within the local system itself.

Frequently Asked Questions

Q1. What is the Oobabooga Text Generation UI?

A. It is a UI created with Gradio Package in Python that allows anyone to download and run any large language model locally.

Q2. How do we download the models with this UI?

A. We can download any models with this UI by just providing the model link to the UI. This model, we can obtain it from the Hugging Face website, which is the place holding 1000s of large language models.

Q3. Will my data be at risk while using these applications?

A. No. Here, we are running the large language model completely on our local machine. We only need the internet when downloading the model; after that, we can infer the model without the internet thus everything happens locally within our computer. The data you use in the chat is not stored anywhere or going anywhere on the internet.

Q4. Can I train a Large Language Model with this UI?

A. Yes, absolutely. You can either fully train any model that you download or create a LoRA out of it. We can download a vanilla large language model like LlaMA or LlaMA 2, train them from scratch with our custom data for any application, and then infer the model based on it.

Q5. Can we run quantized models on it?

A. Yes, we can run the quantized models like the 2bit, 4bit, 6bit, and 8bit quantized models on it. It fully supports the models quantized with GPTQ, GGML, and others like ExLlaMA and Llama.cpp. If you have a more giant GPU, you can run the whole model without quantization.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.