Build Semantic Search Applications Using Open Source Vector Database ChromaDB

Avikumar Talaviya 18 Jul, 2023

6 min read

Introduction

With the rise of AI applications and use cases, there has been an increased flow of various tools and technologies to facilitate such AI applications and allow AI developers to build real-world applications. Among such tools, today we will learn about the workings and functions of ChromaDB, an open-source vector database to store embeddings from AI models such as GPT3.5, GPT-4, or any other OS model. Embedding is a crucial component of any AI application pipeline. As computers only process vectors, all the data must be vectorized in the form of embeddings to be used in semantic search applications.

So let’s dive deeper into the working of ChromDB with hands-on code examples!

This article was published as a part of the Data Science Blogathon.

Fundamentals of ChromaDB and Installing Library

ChromaDB is an open-source vector database designed to store vector embeddings to develop and build large language model applications. The database makes it simpler to store knowledge, skills, and facts for LLM applications.

ChromaDB is an open-source vector database designed to store vector embeddings to develop and build large language model applications. The database makes it simpler to store knowledge, skills, and facts for LLM applications.

The above Diagram shows the workings of chromaDB when integrated with any LLM application. ChromaDB gives us a tool to perform the following functions:

Store embeddings and their metadata with ids
Embed documents and queries
Search embeddings

ChromaDB is super simple to use and set up with any LLM-powered application. It is designed to boost developer productivity, making it a developer-friendly tool.

Now, let’s install ChromaDB in the Python and Javascript environments. It can also run in Jupyter Notebook, allowing data scientists and Machine learning engineers to experiment with LLM models.

Python Installation

# install chromadb in the Python environment
pip install chromadb

Javascript Installation

# install chromadb in JS environment
npm install --save chromadb # yarn add chromadb

After the installation of the library, we will learn about various functions of it in the next sections.

Build A ChatGPT For YouTube Videos with Langchain

Functions and Workings of ChromaDB

We can use a Jupyter Notebook environment like Google Colab for our demo purposes. You can either do the following hands-on exercises in a Google Colab, Kaggle, or Local notebook environment.

Creating ChromaDB Collection

# import chromadb and create a client
import chromadb

client = chromadb.Client()
collection = client.create_collection("my-collection")

In the above code, we have instantiated the client object to create the “my-collection” collection in the repository folder.

The collection is where embeddings, documents, and any additional metadata are stored to query later for various applications.

Add Documents to the Collection

# add the documents in the db
collection.add(
    documents=["This is a document about cat", "This is a document about car",
     "This is a document about bike"],
    metadatas=[{"category": "animal"}, {"category": "vehicle"}, 
    {"category": "vehicle"}],
    ids=["id1", "id2","id3"]
)

Now, we have added a few of the sample documents along with metadata and ids to store them in a structured manner.

ChromaDB will store the text documents and handle tokenization, vectorization, and indexing automatically without any extra commands.

Query the Collection Database

# ask the querying to retrieve the data from DB
results = collection.query(
    query_texts=["vehicle"],
    n_results=1
)

------------------------------[Results]-------------------------------------
{'ids': [['id2']],
 'embeddings': None,
 'documents': [['This is a document about car']],
 'metadatas': [[{'category': 'vehicle'}]],
 'distances': [[0.8069301247596741]]}

By simply calling the ‘query()’ function on the collection database, it will return the most similar text based on the input query with their metadata and ids. In our example, the query returns similar text containing ‘vehicle’ metadata.

Semantic Search Application with Sample Documents

Semantic search is one of the most popular applications in the technology industry and is used in web searches by Google, Baidu, etc. Language models now allow the development of such applications at an individual level or for a business organization with embeddings of a huge amount of data.

We will use the “pets” folder with a few sample documents to work around the semantic search application in ChromaDB. We have the following files in a local folder:

Semantic Search Applications with sample documents | ChromaDB

Let’s import files from the local folder and store them in “file_data”.

# import files from the pets folder to store in VectorDB
import os

def read_files_from_folder(folder_path):
    file_data = []

    for file_name in os.listdir(folder_path):
        if file_name.endswith(".txt"):
            with open(os.path.join(folder_path, file_name), 'r') as file:
                content = file.read()
                file_data.append({"file_name": file_name, "content": content})

    return file_data

folder_path = "/content/pets"
file_data = read_files_from_folder(folder_path)

The above code takes files from the “pets” folder and appends them in a “file_data” as a list of all the files. we will use these files to store in ChromaDB as embeddings for querying purposes.

# get the data from file_data and create chromadb collection
documents = []
metadatas = []
ids = []

for index, data in enumerate(file_data):
    documents.append(data['content'])
    metadatas.append({'source': data['file_name']})
    ids.append(str(index + 1))

# create a collection of pet files 
pet_collection = client.create_collection("pet_collection")

# Add files to the chromadb collection
pet_collection.add(
    documents=documents,
    metadatas=metadatas,
    ids=ids
)

The above code takes files, and metadata from the list of files and adds them into the chromaDB collection called “pet_collection”.

Here we need to take note that by default chromadb uses the “all-MiniLM-L6-v2″ embedding model from sentence transformers which converts text documents into vectors. Now, let’s query the collection to see the results.

# query the database to get the answer from vectorized data
results = pet_collection.query(
    query_texts=["What is the Nutrition needs of the pet animals?"],
    n_results=1
)

results

As we query the collection, it automatically finds the most similar document for our query from the embedded documents which then resulted in an output. we can also see the distance metric in the output which shows how close the certain document was to our query.

Using Different Embedding Models

So far we have used the default embedding model for the vectorization of input texts but ChromaDB allows various other models from the sentence transformer library as well. we will use the “paraphrase-MiniLM-L3-v2” model to embed the same pets document for our semantic search application.

(Note: Please install the sentence_transformers library before executing the below code, if you haven’t)

# import the sentence transformers
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('paraphrase-MiniLM-L3-v2')

documents = []
embeddings = []
metadatas = []
ids = []

# enumerate through file_data to collection each document and metadata
for index, data in enumerate(file_data):
    documents.append(data['content'])
    embedding = model.encode(data['content']).tolist()
    embeddings.append(embedding)
    metadatas.append({'source': data['file_name']})
    ids.append(str(index + 1))

# create the new chromaDB and use embeddings to add and query data
pet_collection_emb = client.create_collection("pet_collection_emb")

# add the pets files into the pet_collection_emb database
pet_collection_emb.add(
    documents=documents,
    embeddings=embeddings,
    metadatas=metadatas,
    ids=ids
)

The above code uses the “paraphrase-MiniLM-L3-v2” model to encode the input files while adding to the new collection.

Now, we can query the database again to get the most similar results.

# write text query and submit to the collection 
query = "What are the different kinds of pets people commonly own?"
input_em = model.encode(query).tolist()

results = pet_collection_emb.query(
    query_embeddings=[input_em],
    n_results=1
)
results

Embeddings Supported in ChromaDB

Embeddings are the native way to store all kinds of data for AI applications. They can represent text, images, audio, and video data as per the requirements of the applications.

ChromaDB supports many AI models from different embedding providers, such as OpenAI, Sentence transformers, Cohere, and the Google PaLM API. Let’s look at some of them here.

Sentence Transformer Embeddings

# loading any model from sentence transformer library
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2")

Using the above code, we can use any model from the available models. you can find the list of models here.

OpenAI Models

ChromaDB provides a wrapper function to use any embedding model API from OpenAI for AI applications

# function to call OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key="YOUR_API_KEY",
                model_name="text-embedding-ada-002"
            )

For more detailed information on ChromaDB functions, please visit their official documentation here.

Github code repository: Click Here

Conclusion

In conclusion, vector databases are the key building blocks for Generative AI applications. ChromaDB is one such vector database that is increasingly used in a wide range of LLM-based applications. In this blog, we learned about ChromaDb’s various functions and workings using the code example.

Key Takeaways

We learned various functions of ChromaDB with code examples.
We learned about chromaDB’s use cases in a semantic search application.
Finally, we saw types of embeddings such as OpenAI, Cohere, and sentence transformers that are supported by ChromaDB.

Frequently Asked Questions

Q1. What is chroma DB used for?

A. ChromaDB is an AI-native open-source database designed to be used for LLM bases applications to make knowledge, and skills pluggable for LLMs.

Q2. Is chromaDB free?

A.Yes, ChromaDB is free to use for any personal or commercial purpose under Apache 2.0 license.

Q3. Is chromaDB in memory?

A. ChromaDB is flexible in its nature. It works for in-memory as well as embedded configuration for any LLM-based application.

Q4. What is the difference between ChromaDB and LangChain?

A.ChromaDB is a vector database that stores the data in an embedding form while LangChain is a framework to load large amounts of data for any use-case.

Q5. What are the embeddings supported by ChromaDB?

A. ChromaDB supports sentence transformers models, OpenAI APIs, and Cohere or any other OS model to store embeddings.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.