How to Build Low-Latency Machine (LLM) Application Using Vector Database?

Apurva Kumar 12 Mar, 2024

3 min read

Introduction

In the ever-evolving landscape of technology, we find ourselves on the cusp of a groundbreaking revolution in the world of data storage and retrieval. Imagine a world where applications can process vast amounts of information at lightning speed, effortlessly searching, and analyzing data with unparalleled efficiency. This is the promise of Vector Databases, a cutting-edge technology that is redefining the way we interact with data. In this article, we explore the world of Vector Databases and their incredible potential, focusing specifically on their role in the creation of Low-Latency Machine (LLM) applications. Join us! As the intricate fusion of cutting-edge technology and innovative application development to unlock the secrets of building LLM apps using Vector Databases. Get ready to revolutionize the harness data, as we unveil the keys to unlock the future of data-driven applications!

For example, if you ask, “How do I change my language in the Android app?” to the Amazon customer service app, it might not have been trained on this exact text and hence might be unable to answer. This is where a vector database comes to the rescue. A vector database stores the domain texts (in this case, help docs) and past queries by all the users, including order history, etc., as numerical embeddings and provides a lookup of similar vectors in real-time. In this case, it encodes this query into a numerical vector and uses it to perform a similarity search in its database of vectors and find its closest neighbors. With this help, the chatbot can guide the user correctly to the “Change your language preference” section on the Amazon app.

Learning Objectives

How do LLMs work, what are their limitations, and why do they need vector databases?
Introduction to embedding models and how to encode and use them in applications.
Learn what is a vector database and how they are part of LLM application architecture.
Learn how to code LLM/Generative AI applications using vector databases and tensorflow.

This article was published as a part of the Data Science Blogathon.

What are LLMs?
How do LLMs work?
Limitations of LLMs
LLMs and Vector Databases
A Quick Tutorial on Embeddings
LLM Application Architecture
LLM Applications Using Vector Databases
Building a Chatbot App
Building an Image Generator App
Building a Movie Recommendation Low-Latency Machine Application
Real-world Use Cases of LLMs Apps Using Vector Search/Database

Frequently Asked Questions

What are LLMs?

Large Language Models (LLMs) are foundational machine learning models that use deep learning algorithms to process and understand natural language. These models are trained on massive amounts of text data to learn patterns and entity relationships in the language. LLMs can perform many types of language tasks, such as translating languages, analyzing sentiments, chatbot conversations, and more. They can understand complex textual data, identify entities and relationships between them, and generate new text that is coherent and grammatically accurate.

How do LLMs work?

LLMs are trained using a large amount of data, often terabytes, even petabytes, with billions or trillions of parameters, enabling them to predict and generate relevant responses based on the user’s prompts or queries. They process input data through word embeddings, self-attention layers, and feedforward networks to generate meaningful text. You can read more about LLM architectures here.

Limitations of LLMs

While LLMs seem to generate responses with quite a high accuracy, even better than humans in many standardized tests, these models still have limitations. Firstly, they solely rely on their training data to build their reasoning and hence may lack specific or current information in the data. This leads to the model generating incorrect or unusual responses, AKA “hallucinations.” There has been an ongoing effort to mitigate this. Secondly, the model may not behave or respond in a manner that aligns with the user’s expectations.

To address this, vector databases and embedding models enhance the knowledge of LLMs/Generative AI by providing additional lookups to similar modalities (text, image, video, etc.) for which the user is seeking information. Here is an example where LLMs do not have the response the user asks for and instead rely on a vector database to find that information.

LLMs and Vector Databases

Large Language Models (LLMs) are being utilized or integrated in many parts of industry, such as e-commerce, travel, search, content creation, and finance. These models rely on a relatively newer type of database, known as a vector database, which stores a numerical representation of text, images, videos, and other data in a binary representation called embeddings. This section highlights the fundamentals of vector databases and embeddings and, more significantly, focuses on how to use them to integrate with LLM applications.

A vector database is a database that stores and searches for embeddings using high-dimensional space. These vectors are numerical representations of a data’s features or attributes. Using algorithms that calculate the distance or similarity between vectors in a high-dimensional space, vector databases can quickly and efficiently retrieve similar data. Unlike traditional scalar-based databases that store data in rows or columns and use exact matching or keyword-based search methods, vector databases operate differently. They use vector databases to search and compare a large collection of vectors in a very short amount of time (order of milliseconds) using techniques such as Approximate Nearest Neighbors (ANN).

A Quick Tutorial on Embeddings

AI models generate embeddings by inputting raw data such as text, video, images to a vector embedding library such as word2vec and In the context of AI and machine learning, these features represent different dimensions of the data that are essential for understanding patterns relationships, and underlying structures.

Here is an example of how to generate word embeddings using word2vec.

1. Generate the model using your custom corpus of data or use a sample prebuilt model from Google or FastText. If you generate your own, you can save it to your file system as a “word2vec.model” file.

import gensim

# Create a word2vec model
model = gensim.models.Word2Vec(corpus)

# Save the model file
model.save('word2vec.model')

2. Load the model, generate a vector embedding for an input word, and use it to get similar words in the vector embedding space.

import gensim
import numpy as np

# Load the word2vec model
model = gensim.models.Word2Vec.load('word2vec.model')

# Get the vector for the word "king"
king_vector = model['king']

# Get the most similar vectors to the king vector
similar_vectors = model.similar_by_vector(king_vector, topn=5)

# Print the most similar vectors
for vector in similar_vectors:
    print(vector[0], vector[1])

3. Here are the top 5 words close to the input word.

Output:

man 0.85
prince 0.78
queen 0.75
lord 0.74
emperor 0.72

LLM Application Architecture

At a high level, vector databases rely on embedding models for handling both the creation and querying of embeddings. On the ingestion path, the corpus content is encoded into vectors using the embedding model and stored in vector databases like Pinecone, ChromaDB, Weaviate, etc. On the read path, the application makes a query using sentences or words, and it is again encoded by the embedding model into a vector that is then queried into the vector db to fetch the results.

LLM Applications Using Vector Databases

LLM apps helps in language tasks and is embedded into a broader class of models, such as Generative AI that can generate images and videos apart from just text. In this section, we will learn how to build practical LLM/Generative AI applications using vector databases. I used transformers and torch libs for language models and pinecone as a vector database. You can choose any language model for LLM apps /embeddings and any vector database for storage and searching.

Building a Chatbot App

To build a chatbot using a vector database, you can follow these steps:

Choose a vector database such as Pinecone, Chroma, Weaviate, AWS Kendra, etc.
Create a vector index for your chatbot.
Train a language model using a large text corpus of your choice. For e.g, for a news chatbot, you can feed in news data.
Integrate the vector database and the language model.

Here is a simple example of a chatbot application that uses a vector database and a language model:

import pinecone
import transformers

# Create an API client for the vector database
client = pinecone.Client(api_key="YOUR_API_KEY")

# Load the language model
model = transformers.AutoModelForCausalLM.from_pretrained("google/bigbird-roberta-base")

# Define a function to generate text
def generate_text(prompt):
    inputs = model.prepare_inputs_for_generation(prompt, return_tensors="pt")
    outputs = model.generate(inputs, max_length=100)
    return outputs[0].decode("utf-8")

# Define a function to retrieve the most similar vectors to the user's query vector
def retrieve_similar_vectors(query_vector):
    results = client.search("my_index", query_vector)
    return results

# Define a function to generate a response to the user's query
def generate_response(query):
    # Retrieve the most similar vectors to the user's query vector
    similar_vectors = retrieve_similar_vectors(query)

    # Generate text based on the retrieved vectors
    response = generate_text(similar_vectors[0])

    return response

# Start the chatbot
while True:
    # Get the user's query
    query = input("What is your question? ")

    # Generate a response to the user's query
    response = generate_response(query)

    # Print the response
    print(response)

This chatbot application will retrieve the most similar vectors to the user’s query vector from the vector database and then generate text using the language model based on the retrieved vectors.

ChatBot > What is your question?
User_A> How tall is the Eiffel Tower?
ChatBot>The height of the Eiffel Tower measures 324 meters (1,063 feet) 
from its base to the top of its antenna.

Building an Image Generator App

Let’s explore how to build an Image Generator app that uses both Generative AI and Low-Latency Machine Application libraries.

Create a vector database to store your image vectors.
Extract image vectors from your training data.
Insert the image vectors into the vector database.
Train a generative adversarial network (GAN). Read here if you need an introduction to GAN.
Integrate the vector database and the GAN.

Here is a simple example of a program that integrates a vector database and a GAN to generate images:

import pinecone
import torch
from torchvision import transforms

# Create an API client for the vector database
client = pinecone.Client(api_key="YOUR_API_KEY")

# Load the GAN
generator = torch.load("generator.pt")

# Define a function to generate an image from a vector
def generate_image(vector):
    # Convert the vector to a tensor
    tensor = torch.from_numpy(vector).float()

    # Generate the image
    image = generator(tensor)

    # Transform the image to a PIL image
    image = transforms.ToPILImage()(image)

    return image

# Start the image generator
while True:
    # Get the user's query
    query = input("What kind of image would you like to generate? ")

    # Retrieve the most similar vector to the user's query vector
    similar_vectors = client.search("my_index", query)

    # Generate an image from the retrieved vector
    image = generate_image(similar_vectors[0])

    # Display the image
    image.show()

This program will retrieve the most similar vector to the user’s query vector from the vector database and then generate an image using the GAN based on the retrieved vector.

ImageBot>What kind of image would you like to generate?
Me>An idyllic image of a mountain with a flowing river.
ImageBot> Wait a minute! Here you go...

You can customize this program to meet your specific needs. For example, you can train a GAN specialized in generating a particular type of image, such as portraits or landscapes.

Building a Movie Recommendation Low-Latency Machine Application

Let’s explore how to build a movie recommendation app from a movie corpus. You can use a similar idea to build a recommendation system for products or other entities.

Create a vector database to store your movie vectors.
Extract movie vectors from your movie metadata.
Insert the movie vectors into the vector database.
Recommend movies to users.

Here is an example of how to use the Pinecone API to recommend movies to users:

import pinecone

# Create an API client
client = pinecone.Client(api_key="YOUR_API_KEY")

# Get the user's vector
user_vector = client.get_vector("user_index", user_id)

# Recommend movies to the user
results = client.search("movie_index", user_vector)

# Print the results
for result in results:
    print(result["title"])

Here is a sample recommendation for a user

The Shawshank Redemption
The Dark Knight
Inception
The Godfather
Pulp Fiction

Real-world Use Cases of LLMs Apps Using Vector Search/Database

Microsoft and TikTok use vector databases such as Pinecone for long-term memory and faster lookups. This is something Low-Latency Machine Application cannot do alone without a vector database. It is helping users save their past questions/ responses and resume their session. For example, users can ask, “Tell me more about the pasta recipe we discussed last week.” Read here.

Flipkart’s Decision Assistant recommends products to users by first encoding the query as vector embedding and doing a lookup against vectors storing relevant products in high dimensional space. For example, if you search for “Wrangler leather jacket brown men medium,” it recommends relevant products to the user using a vector similarity search. Otherwise, Low-Latency Machine Application would not have any recommendations, as no product catalog would contain such titles or product details. You can read it here.
Chipper Cash, a fintech in Africa, uses a vector database to reduce fraud user signups by 10x. It does this by storing all the images of previous user signups as vector embeddings. Then, when a new user signs up, it encodes it as a vector and compares it against the existing users to detect fraud. You can read it here.

Vector Database in LLM — Source: Chipper Cash

Facebook has been using its vector search library called FAISS (blog) in many products internally, including Instagram Reels and Facebook Stories, to do a quick lookup of any multimedia and find similar candidates for better suggestions to be shown to the user.

Conclusion

Vector databases are useful for building various Low-Latency Machine Application, such as image generation, movie or product recommendations, and chatbots. They provide LLMs apps with additional or similar information that LLMs apps have not been trained on. They store the vector embeddings efficiently in a high dimensional space and use nearest neighbors search to find similar embeddings with high accuracy.

Key Takeaways

The key takeaways from this article are that vector databases are highly suitable for Low-Latency Machine Application and offer the following significant features for users to integrate with:

Performance: Vector databases are specifically designed to efficiently store and retrieve vector data, which is important for developing high-performance Low-Latency Machine Application
Precision: Vector databases can accurately match similar vectors, even if they exhibit slight variations. They use nearest-neighbor algorithms to compute similar vectors.
Multi-Modal: Vector databases can accommodate various multi-modal data, including text, images, and sound. This versatility makes them an ideal choice for Low-Latency Machine Application/Generative AI apps that necessitate working with diverse data types.
Developer-friendly: Vector databases are relatively user-friendly, even for developers who may not possess extensive knowledge of machine learning techniques.

In addition, I would like to highlight that many existing SQL/NoSQL solutions already add vector embedding storage, indexing, and similarity search features, e.g., PostgreSQL and Redis.

Frequently Asked Questions

Q1. What are LLMs?

A. LLMs apps are advanced Artificial Intelligence (AI) programs trained on a large corpus of text data using neural networks to mimic human-like responses with context. They can predict, answer, and generate textual data in the domain they have been trained on.

Q2. What are embeddings?

A. Embeddings are numerical representations of text, images, video, or other data formats. They make colocating and finding semantically similar objects easier in a high-dimensional space.

Q3. What is a vector database? Why do LLMs apps need them?

A. A database stores and queries high-dimensional vector embeddings to find similar vectors using nearest-neighbour algorithms such as locality-sensitive hashing. LLMs apps /Generative AI needs them to help them provide additional lookups for similar vectors instead of fine-tuning the LLM apps themselves.

Q4. What is the future of vector databases?

A. Vector databases are niche databases that help index and search vector embeddings. They are widely popular in the open-source community, and many organizations/ apps are integrating with them. However, many existing SQL/NoSQL databases are adding similar capabilities so that the developer community will have many options in the near future.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.