A Beginners Guide to LLMOps For Machine Learning Engineering

Sai Battula 22 Jan, 2024

12 min read

Introduction

The release of OpenAI’s ChatGPT has inspired a lot of interest in large language models (LLMs), and everyone is now talking about artificial intelligence. But it’s not just friendly conversations; the machine learning (ML) community has introduced a new term called LLMOps. We have all heard of MLOps, but what is LLMOps? Well, it’s all about how we treat and manage these powerful language models throughout their lifecycle.

LLMs are converting the way we create and maintain AI-driven products, and this shift is leading to the need for new tools and best practices. In this article, we’ll melt down LLMOps and its background. We’ll also examine how building AI products with LLMs differs from traditional ML models. Plus, we’ll look at how MLOps (Machine Learning Operations) differs from LLMOps due to these differences. Finally, we’ll discuss what exciting developments we can expect in the world of LLMOps space shortly.

Learning Objectives:

Gain a sound understanding of LLMOps and its development.
Learn to build a model using LLMOps through examples.
Know the differences between LLMOps and MLOps.
Get a sneak peek into the future of LLMOps.

This article was published as a part of the Data Science Blogathon.

What is LLMOps?

LLMOps stands for Large Language Model Operations, similar to MLOps but specifically designed for Large Language Models (LLMs). It requires using new tools and best practices to handle everything related to LLM-powered applications, from development to deployment and continuing maintenance.

To understand this better, let’s break down what LLMs and MLOps mean:

LLMs are large language models that can generate human languages. They have billions of parameters and are trained on billions of text data.
MLOps (Machine Learning Operations) is a set of tools and practices used to manage the lifecycle of applications powered by machine learning.

Now that we’ve explained the basics, let’s dive into this topic more deeply.

What’s the Hype around LLMOps?

Firstly, LLMs like BERT and GPT-2 have been around since 2018. Yet, it is now, almost five years later, that we are encountering a flashing rise of the idea of LLMOps. The main reason is that LLMs obtained much media attention with the release of ChatGPT in December 2022.

Hype of LLMOps due to LLMs and AI models

Since then, we have seen many different types of applications exploiting the power of LLMs. This includes chatbots ranging from familiar examples like ChatGPT, to more personal writing assistants for editing or summarization (e.g., Notion AI) and skilled ones for copywriting (e.g., Jasper and copy.ai). It also includes programming assistants for writing and debugging code (e.g., GitHub Copilot), testing the code (e.g., Codium AI), and identifying security trouble (e.g., Socket AI).

With many people developing and carrying LLM-powered applications to production, people are contributing their experiences.

“It’s easy to make something cool with LLMs, but very hard to make something production-ready with them.” - Chip Huyen

It is clear that building production-ready LLM-powered applications comes with its own set of difficulties, distinct from building AI products with classical ML models. We must develop new tools and best practices to deal with these challenges to govern the LLM application lifecycle. Thus, we see an expanded use of the term “LLMOps.”

What are the Steps Involved in LLMOps?

The steps involved in LLMOps are at least similar to MLOps. However, the steps of building an LLM-powered application are different due to the beginning of the foundation models. Instead of training LLMs from scratch, the focus lies on domesticating pre-trained LLMs to the following tasks.

Already over a year ago, Andrej Karpathy told how the process of building AI products will change in the future:

“But the most important trend is that the whole setting of training a neural network from scratch on some target task is quickly becoming outdated due to finetuning, especially with the emergence of base models like GPT. These base models are trained by only a few institutions with substantial computing resources, and most applications are achieved via lightweight finetuning of part of the network, prompt engineering, or an optional step of data or model processing into smaller, special-purpose inference networks.” - Andrej Karpathy.

This quote may be stunning the first time you read it. But it exactly summarizes everything that has been going on lately, so let’s describe it step by step in the following subsections.

Step 1: Selection of a Base Model

Foundation models or base models are LLMs pre-trained on large amounts of data that can be used for a wide range of tasks. Because training a base model from scratch is difficult, time-consuming, and extremely expensive, only a few institutions have the required training resources.

To put it into perspective, according to a study from Lambda Labs in 2020, training OpenAI’s GPT-3 (with 175 billion parameters) would require 355 years and $4.6 million using a Tesla V100 cloud instance.

AI is currently going through what the community calls its “Linux Moment.” Currently, developers have to choose between two types of base models based on an exchange between performance, cost, ease of use, and flexibility of proprietary models or open-source models.

Proprietary vs open-source foundation models

Exclusive or proprietary models are closed-source foundation models possessed by companies with large expert teams and big AI budgets. They usually are larger than open-source models and have better performance. They are also bought and generally rather easy to use. The main downside of proprietary models is their expensive APIs (application programming interfaces). Additionally, closed-source foundation models offer less or no elasticity for adaptation for developers.

Examples of proprietary model providers are:

OpenAI (GPT-3, GPT-4)
co:here
Anthropic (Claude)
AI21 Labs (Jurassic-2)

Open-source models are frequently organized and hosted on HuggingFace as a community hub. Usually, they are smaller models with lower capabilities than proprietary models. But on the upside, they are more economical than proprietary models and offer more flexibility for developers.

Examples of open-source models are:

Stable Diffusion by Stability AI
BLOOM by BigScience
LLaMA or OPT by Meta AI
Flan-T5 by Google

Code:

This step involves importing all required libraries.

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Can you load pre-trained GPT-3 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("gpt2")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Output of the above code:

Step 2: Adapting to the Following Tasks

Once you have chosen your base model, you can access the LLM through its API. If you usually work with other APIs, working with LLM APIs will primarily feel a little weird because it is not always clear what input will cause what output earlier. Given any text prompt, the API will return a text completion, attempting to match your pattern.

Here is an example of how you would use the OpenAI API. You give the API input as a prompt, e.g., prompt = “Correct this to standard English:\n\nHe no went to the market.”

import openai
openai.api_key = ...
response = openai.Completion.create(
	engine = "text-davinci-003",
	prompt = "Correct this to standard English:\n\nHe no went to the market.",
	# ...
	)

The API will output a reply containing the completion response[‘choices’][0][‘text’] = “He did not go to the market.”

The main challenge is that LLMs aren’t mighty despite being powerful, and thus, the key question is: How do you get an LLM to give the output you want?

One concern respondents mentioned in the LLM in-production survey was model accuracy and hallucination. That means getting the output from the LLM API in your desired format might take some iterations, and also, LLMs can hallucinate if they don’t have the required specific knowledge. To deal with these concerns, you can adapt the base models to the following tasks in the following ways:

Prompt engineering is a technique to improve the input so that the output matches your expectations. You can use different tricks to improve your prompt (see OpenAI Cookbook). One method is to provide some examples of the expected output format. This is similar to zero-shot learning or few-shot learning. Tools like LangChain or HoneyHive are already available to help you manage and version your prompt templates.

Prompt engineering for LLMs | ML | AI | MLOps

Fine-tuning pre-trained models is a technique seen in ML. It can help improve your model’s performance and accuracy on your specific task. Although this will increase the training efforts, it can decrease the cost of inference. The cost of LLM APIs is dependent on input and output sequence length. Thus, decreasing the number of input tokens reduces API costs because you no longer have to give examples in the prompt.

External Data: Base models often short contextual information (e.g., access to some specific documents) and can become outdated rapidly. For example, GPT-4 was trained on data until September 2021. Because LLMs can imagine things if they don’t have sufficient information, we need to be able to give them access to important external data.
Embeddings: A slightly more complex way is to extract information in the form of embeddings from LLM APIs (e.g., product descriptions) and build applications on top of them (e.g., search, comparison, recommendations).
Alternatives: As this field is quickly evolving, there are many more applications of LLMs in AI products. Some examples are instruction tuning/prompt tuning and model refining.

Code:

from transformers import GPT2LMHeadModel, GPT2Tokenizer, TextDataset, Trainer, TrainingArguments

# Load your dataset
dataset = TextDataset(tokenizer=tokenizer, file_path="your_dataset.txt")

# Fine-tune the model
training_args = TrainingArguments(
    output_dir="./your_fine_tuned_model",
    overwrite_output_dir=True,
    num_train_epochs=3,
    per_device_train_batch_size=4,
)

trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=dataset,
)

trainer.train()
trainer.save_model()

Step 3: Model Evaluation

In classical MLOps, ML models are demonstrated on a hold-out validation set with a metric that denotes the models’ performance. But how do you evaluate the execution of an LLM? How do you decide whether an output is good or bad? Currently, it seems like organizations are A/B testing their models.

To help evaluate LLMs, tools like HoneyHive or HumanLoop have emerged.

Code:

from transformers import pipeline

# Create a text generation pipeline
generator = pipeline("text-generation", model="your_fine_tuned_model")

# Generate text and evaluate
generated_text = generator("Prompt text")
print(generated_text)

Step 4: Deployment and Monitoring

The achievement of LLMs can extremely change between releases. For example, OpenAI has updated its models to relieve inappropriate content generation, e.g., hate speech. As a result, scanning for the phrase “as an AI language model” on Twitter now reveals countless bots.

There are already tools for monitoring LLMs appearing, such as Whylabs or HumanLoop.

Code:

# Import your necessary libraries
from flask import Flask, request, jsonify
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import logging

# Initialize Flask app
app = Flask(__name__)

# you can load the fine-tuned GPT-2 model and tokenizer
model = GPT2LMHeadModel.from_pretrained("./your_fine_tuned_model")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

# Set up logging
logging.basicConfig(filename='app.log', level=logging.INFO)

# Define a route for text generation
@app.route('/generate_text', methods=['POST'])
def generate_text():
    try:
        data = request.get_json()
        prompt = data['prompt']

        # Generate text
        generated_text = model.generate(
            tokenizer.encode(prompt, return_tensors='pt'),
            max_length=100,  # Adjust max length as needed
            num_return_sequences=1,
            no_repeat_ngram_size=2,
            top_k=50,
            top_p=0.95,
        )[0]

        generated_text = tokenizer.decode(generated_text, skip_special_tokens=True)
        
        # Log the request and response
        logging.info(f"Generated text for prompt: {prompt}")
        logging.info(f"Generated text: {generated_text}")

        return jsonify({'generated_text': generated_text})

    except Exception as e:
        # Log any exceptions
        logging.error(f"Error: {str(e)}")
        return jsonify({'error': 'An error occurred'}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Working of Above code:

Import the necessary libraries: This means importing the required libraries and modules. Flask is used to build web applications, transformers are used to carry and handle the GPT-2 model, and logging is used to record the information.
Initialize the Flask app.
Load the model: You can load the pretrained GPT-2 model and corresponding tokenizer. You can replace them ./your_fine_tuned_model with the path to your real fine-tuned GPT-2 model.
Set up logging: This denotes logging into the application. It sets the log file name to app.log and sets the logging level to INFO.
Set up a route sketch using Flask: It specifies that when a POST request is made to the /generate_text endpoint, the generate_text function should be called.
Generating text: This code extracts JSON data from the incoming POST request. It assumes that the JSON data includes a “prompt” field, which is the text that will be used to generate additional text.
Text generation using GPT-2: This section uses the loaded GPT-2 model and tokenizer to generate text based on the provided prompt. It sets different generation parameters, such as the generated text of the maximum length, the number of Series to generate, and the sampling parameters.
Decoding and returning generated text: After generating the text, it decodes the generated series and removes special tokens. Then, it returns the generated text as a JSON response.
Logging the request and response: It logs the request’s prompt and the generated text in the log file.
Handling exceptions: If any exceptions occur during the text generation process, they are Captured and logged as errors. A JSON output with an error message is returned along with a 500 status code to denote a server error.
Running the Flask app: It ensures that the Flask app is only run when the script is executed Straightway. It runs the app on host ‘0.0.0.0’ and port 5000, making it convenient from any IP address.

Output of above code:

Input prompt:

#{

    "prompt": "Once upon a time"

}

Output prompt:

{

    "generated_text": "Once upon a time, in a faraway land, there lived a..."

}import csv

How is LLMOps Different from MLOps?

The differences between MLOps and LLMOps arise from the differences in how we build AI products with classical ML models versus LLMs. The differences mostly affect data management, experimentation, evaluation, cost, and latency.

Data Management

In standard MLOps, we are used to data-hungry ML models. Training a neural network from scratch needs a lot of labeled data, and even fine-tuning a pre-trained model involves at least a few hundred samples. However, data cleaning is essential to the ML development process, as we know and accept that large datasets have defects.

In LLMOps, fine-tuning is similar to MLOps. But prompt engineering is a zero-shot or few-shot learning circumstance. That means we have few but hand-picked samples.

Experimentation

In MLOps, the investigation looks similar to whether you train a model from scratch or fine-tune a pre-trained one. In both cases, you will route inputs, such as model architecture, hyperparameters, and data augmentations, and outputs, such as metrics.

But in the LLMOps, the question is whether to engineer prompts or to fine-tune. However, fine-tuning will look similar to MLOps in LLMOps, while prompt engineering involves a different experimentation setup involving the management of prompts.

Evaluation

In classical MLOps, a hold-out validation set with an evaluation metric evaluates a model’s performance. Because the performance of LLMs is more difficult to evaluate, currently, organizations seem to be using A/B testing.

Cost

While the cost of traditional MLOps usually lies in data collection and model training, the cost of LLMOps lies in inference. Although we can expect some costs from using expensive APIs during experimentation, Chip Huyen shows that the cost of long prompts is in inference.

Speed

Another concern respondents mentioned in the LLM in the production survey was latency. The completion length of an LLM significantly affects latency. Although latency concerns exist in MLOps as well, they are much more prominent in LLMOps because this is a big issue for the experimentation velocity during development and the user experience in production.

The Future of LLMOps

LLMOps is an upcoming field. With the speed at which this space is evolving, making any predictions is difficult. It is even doubtful if the term “LLMOps” is here to stay. We are only sure that we will see a lot of new use cases of LLMs and tools and the best trials to manage the LLM lifecycle.

The field of AI is rapidly growing, potentially making anything we write now outdated in a month. We’re still in the early stages of transporting LLM-powered applications to production. There are many questions we don’t have the answers to, and only time will tell how things will play out:

Is the term “LLMOps” here to stay?
How will LLMOps in light of MLOps evolve? Will they Transform together, or will they become separate sets of operations?
How will AI’s “Linux Moment” play out?

We can say with certainty that we will see many developments and new toolings and best practices soon. Also, we are already looking at efforts being made toward cost and latency reduction for base models. These are definitely interesting times!

Conclusion

Since the release of OpenAI’s ChatGPT, LLMs have become a hot topic in the field of AI. These deep learning models can generate outputs in human language, making them a strong tool for tasks such as conversational AI, programming assistants, and writing assistants.

However, carrying LLM-powered applications to production presents its own set of challenges, which has led to the arrival of a new term, “LLMOps”. It refers to the set of tools and best practices used to manage the lifecycle of LLM-powered applications, including development, deployment, and maintenance.

LLMOps can be seen as a subcategory of MLOps. However, the steps involved in building an LLM-powered application are different from those in building applications with base ML models. Instead of training an LLM from scratch, the focus is on adapting pre-trained LLMs to the following tasks. This involves selecting a foundation model, using LLMs in the following tasks, evaluating them, and deploying and monitoring the model. While LLMOps is still a relatively new field, it is sure to continue to develop and evolve as LLMs become more popular in the AI industry.

Key Takeaways:

LLMOps (Large Language Model Operations) is a scientific field that focuses on managing the lifecycle of mighty language models like ChatGPT, transforming the creation and maintenance of AI-driven products.
The increase in applications utilizing Large Language Models (LLMs) like GPT-3, GPT-3.5, and GPT-4 has led to the rise of LLMOps.
The process of LLMOps includes selecting a base model, adapting it to particular tasks, evaluating model performance through A/B testing, and informing the cost and latency anxiety associated with LLM-powered applications.
LLMOps are different from traditional MLOps in terms of data management (few-shot learning), examination (prompt engineering), evaluation (A/B testing), cost (inference-related costs), and speed (latency reflection).

Overall, the rise of LLMs and LLMOps describes a significant shift in building and maintaining AI-powered products. I hope you liked this article. You can connect with me here on LinkedIn.

Frequently Asked Questions

Q1. What are large language models (LLMs)?

Ans. Large language models (LLMs) are recent improvements in deep learning models to work on human languages. A large language model is a trained deep-learning model that understands and generates text in a human-like fashion. Behind the scenes, a large transformer model does all the magic.

Q2. What are the key steps in LLMOps?

Ans. The key steps followed in LLMOps are:
1. Select a pre-trained Large Language Model as the base for your application.
2. Modify the LLM for particular tasks using techniques like prompt engineering and fine-tuning.
3. Frequently estimate the LLM’s performance through A/B testing and tools like HoneyHive.
4. Deploy the LLM-powered application, continuously monitor its performance, and streamline it.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.