How to Build a Multi-Modal Search App with Chroma?

Sunil Kumar Last Updated : 17 Nov, 2023

10 min read

Introduction

Have you ever wondered how our intricate brains process the world? While the brain’s inner workings remain a mystery, we can liken it to a versatile neural network. Thanks to electrochemical signals, it handles various data types – audio, visuals, smells, tastes, and touch. As AI advances, multi-modal models emerge, revolutionizing search capabilities. This innovation opens up possibilities, enhancing search accuracy and relevance. Discover the fascinating realm of multi-modal search.

Learning Objectives

Understand the term “Multi-modality in AI”.
Gain insights into the OpenAI’s Image-text model CLIP.
Learn what a vector database is and Understand vector Indexing in brief.
Use CLIP and Chroma vector database to build a food recommender with a Gradio interface.
Explore other real-world use cases of a multi-modal search.

This article was published as a part of the Data Science Blogathon.

What is Multi-modality in AI?
Contrastive Language-Image Pre-Training (CLIP)
Why are Vector Databases Required?
What is Gradio?
Building the App
CLIP Embeddings
Load Embeddings
Gradio App
Real-life Use cases
Frequently Asked Question

What is Multi-modality in AI?

If you google it, you will find that multi-modal refers to involving multiple modes or methods in a process. In Artificial Intelligence, the multi-modal models are those neural networks that can process and understand different datatypes. For example, GPT-4 and Bard. These are LLMs that can understand texts and images. Other examples could be Tesla auto driver cars combining visual and sensory data to make sense of the surroundings, and Midjourney or Dalle, which can make pictures out of text descriptions.

Contrastive Language-Image Pre-Training (CLIP)

CLIP is an open-source multi-modal neural network from OpenAI trained on a large dataset of image-text pairs. This ensures CLIP learns to associate visual concepts in images with their text descriptions. The CLIP model can be instructed in human language to classify a wide range of image data without specific training.

The zero-shot capability of CLIP is comparable to that of GPT 3. Therefore, CLIP can be used to classify images into any set of categories without having to be trained on those categories specifically. For example, to classify images of dogs vs. cats, we only need to compare the logit scores of the image with the text description “an image of a dog” or “an image of a cat”; A photo of a cat or dog is more likely to have higher logit scores with their respective text descriptions.

This is known as zero-shot classification because CLIP does not need to be trained on a dataset of images of dogs and cats to be able to classify them. Here’s a visual presentation of how CLIP works.

CLIP uses a Vision Transformer(ViT) for images and a text model for text features. The vector encodings are then projected to a shared vector space with identical dimensions. The dot product between the two is used as a similar score to predict the similarity between the text snippet and the image. In other words, CLIP can classify images into any set of categories without being optimized for it. In this article, We will programmatically implement CLIP.

Why are Vector Databases Required?

Machine learning algorithms do not understand data in their raw format. So, to make it work, we need to transform data into their numerical form. Vectors or embeddings are the numerical representations of various datatypes such as texts, images, audio, and videos. However, traditional databases are not fully capable of querying high-dimensional vector data. To build an application that uses millions of vector embeddings, we need a database that can store, search, and query them. This is not possible with traditional databases. To achieve this, we need vector databases, purpose-built to store and query embeddings.

The following picture illustrates a simplified workflow of a vector database.

We need specialized embedding models capable of capturing the underlying semantic meaning of the data. The models are different for different data types. Use Image models such as Resnet or Visual Transformers for image data. For texts, text models such as Ada and SentenceTransformers are used. For cross-modal interaction, multimodal models such as Tortoise (Text-To-Speech) and CLIP (Text-To-Image) are used. These models will be used to get the embeddings of input data. Vector databases usually have custom implementations of embedding models, but we can also define our models to get embeddings and store them in vector stores.

Indexing

Embeddings are usually high-dimensional, and querying high-dimensional vectors is often time and compute-intensive. Hence, vector databases employ various indexing methods for efficient querying. Indexing refers to organizing high-dimensional vectors in a way that provides efficient querying of nearest-neighbor vectors.

Some popular indexing algorithms are HNSW (Hierarchical Navigable Small World), Product Quantizing, Inverted File System, Scalar Quantization, etc. Out of all these, HNSW is the most popular and widely used algorithm across different vector databases.

For this application, we will use the Chroma Vector Database. Chroma is an open-source vector database. It lets you quickly set up a client to store and query vectors and associated metadata. There are other such vector stores that you can use, such as Weaviate, Qdrant, Milvus, etc.

What is Gradio?

Gradio, written in Python, aims to quickly build a web interface for sharing Machine Learning models as an open-source tool. It lets us set up a demo web interface using Python. It provides the flexibility to create a decent prototype to showcase the backend models.

To know more about building, refer to this article.

Building the App

This section will go through the codes to create a simple restaurant dish recommender app using Gradio, Chroma, and CLIP. Chroma doesn’t yet have out-of-the-box support for multi-modal models. So, this will be a workaround.

There are two ways to use CLIP in your project. Either OpenAI’s CLIP implementation or Huggingface’s implementation of CLIP. For this project, we will use OpenAI’s CLIP. Make sure you have a virtual environment with the following dependencies installed.

clip
torch
chromadb
gradio

This is our directory structure.

├── app.py
├── clip_chroma
├── clip_embeddings.py
├── __init__.py
├── load_data.py

CLIP Embeddings

The first thing we need to do is build a class to extract embeddings of images and texts. As we know, CLIP has two parts to process texts and images. We will use respective models to encode different modalities.

import clip  
import torch

from numpy import ndarray  
from typing import List  
from PIL import Image  

class ClipEmbeddingsfunction:

    def __init__(self, model_name: str = "ViT-B/32", device: str = "cpu"):
        
        self.device = device  # Store the specified device for model execution
        
        self.model, self.preprocess = clip.load(model_name, self.device)

    def __call__(self, docs: List[str]) -> List[ndarray]:
        # Define a method that takes a list of image file paths (docs) as input
        list_of_embeddings = []  # Create an empty list to store the image embeddings
        for image_path in docs:
            image = Image.open(image_path)  # Open and load an image from the provided path
            
            image = image.resize((224, 224))  
            # Preprocess the image and move it to the specified device
            image_input = self.preprocess(image).unsqueeze(0).to(self.device)  
            with torch.no_grad():
                # Compute the image embeddings using the CLIP model and convert 
                #them to NumPy arrays
                embeddings = self.model.encode_image(image_input).cpu().detach().numpy()
            list_of_embeddings.append(list(embeddings[0])) 
        return list_of_embeddings  

    def get_text_embeddings(self, text: str) -> List[ndarray]:
        # Define a method that takes a text string as input
        text_token = clip.tokenize(text)  # Tokenize the input text
        with torch.no_grad():
            # Compute the text embeddings using the CLIP model and convert them to NumPy arrays
            text_embeddings = self.model.encode_text(text_token).cpu().detach().numpy()
        return list(text_embeddings[0])

In the above code, we have defined a class to extract embeddings of texts and images. The class takes the model name and device as inputs. If your device supports Cuda, you can enable it by passing with the device. CLIP supports several models, such as

clip.available_models()

['RN50',
 'RN101',
 'RN50x4',
 'RN50x16',
 'RN50x64',
 'ViT-B/32',
 'ViT-B/16',
 'ViT-L/14',
 'ViT-L/14@336px']

The model name by default is set as “ViT-B/32”. You can pass any other model you wish.

The __call__ method takes a list of image paths and returns a list of numpy arrays. The get_text_embeddings method takes a string input and returns a list of embeddings.

Load Embeddings

We need to populate our vector database first. So, I collected a few images of dishes to add to our collection. So, create a list of image paths and a list of descriptions about them. The image paths will be our documents, while we will store image descriptions as metadata.

But first, create a Chroma collection.

import os
from chromadb import Client, Settings
from clip_embeddings import ClipEmbeddingsfunction
from typing import List

ef = ClipEmbeddingsfunction()
client = Client(settings = Settings(is_persistent=True, persist_directory="./clip_chroma"))
coll = client.get_or_create_collection(name = "clip", embedding_function = ef)

We imported the embedding function we defined earlier and passed it as the default embedding function for the collection.

Now, load the data into the database.

coll.add(ids=[str(i) for i in range(len(img_list))],
         documents = img_list, #paths to images
         metadatas = menu_description,# description of dishes
         )

That’s it. Now, you are ready to build the final part.

Gradio App

First, create an app.py file, import the following dependencies, and initiate the embedding function.

import gradio as gr
from chromadb import Client, Settings
from clip_embeddings import ClipEmbeddingsfunction

client = Client(Settings(is_persistent=True, persist_directory="./clip_chroma"))

ef = ClipEmbeddingsfunction()

As the front end, we will this to build a simple interface that takes a search query, either a text or an image, and shows relevant image outputs.

with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            query = gr.Textbox(placeholder = "Enter query")
            gr.HTML("OR")
            photo = gr.Image()
            button = gr.UploadButton(label = "Upload file", file_types=["image"])
        with gr.Column():
            gallery = gr.Gallery().style(
                                     object_fit='contain', 
                                     height='auto', 
                                     preview=True
                                  )

Now, we will define trigger events for the gradio app.

query.submit(
        fn = retrieve_image_from_query, 
        inputs=[query], 
        outputs=
        )
    button.upload(
        fn = show_img, 
        inputs=[button],
        outputs = [photo]).\
        then(
            fn = retrieve_image_from_image, 
            inputs=[button], 
            outputs=
            )

In the above code, we have trigger events. We process a text query with the retrieve_image_from_query function. We first render images on the photo object and then invoke retrieve_image_from_image(), displaying the output on the Gallery object.

Run the app.py file with the gradio command and visit the local address shown in the terminal.

Now, we will define the actual functions.

def retrieve_image_from_image(image):
    # Get a collection named "clip" using the specified embedding function (ef)
    coll = client.get_collection(name="clip", embedding_function=ef)

    # Extract the name of the image file
    image = image.name

    # Query the collection using the image file name as the query text
    result = coll.query(
        query_texts=image,  # Use the image file name as the query text
        include=["documents", "metadatas"],  # Include both documents and metadata in the results
        n_results=4  # Specify the number of results to retrieve
    )

    # Get the retrieved documents and their metadata
    docs = result['documents'][0]
    descs = result["metadatas"][0]

    # Create a list to store pairs of documents and their corresponding metadata
    list_of_docs = []

    # Iterate through the retrieved documents and metadata
    for doc, desc in zip(docs, descs):
        # Append a tuple containing the document and its metadata to the list
        list_of_docs.append((doc, list(desc.values())[0]))

    # Return the list of document-metadata pairs
    return list_of_docs

We also have another function to handle text queries.

def retrieve_image_from_query(query: str):
    # Get a collection named "clip" using the specified embedding function (ef)
    coll = client.get_collection(name="clip", embedding_function=ef)

    # Get text embeddings for the input query using the embedding function (ef)
    emb = ef.get_text_embeddings(text=query)

    # Convert the text embeddings to float values
    emb = [float(i) for i in emb]

    # Query the collection using the text embeddings
    result = coll.query(
        query_embeddings=emb,  # Use the text embeddings as the query
        include=["documents", "metadatas"],  # Include both documents and metadata in the results
        n_results=4  # Specify the number of results to retrieve
    )

    # Get the retrieved documents and their metadata
    docs = result['documents'][0]
    descs = result["metadatas"][0]

    # Create a list to store pairs of documents and their corresponding metadata
    list_of_docs = []

    # Iterate through the retrieved documents and metadata
    for doc, desc in zip(docs, descs):
        # Append a tuple containing the document and its metadata to the list
        list_of_docs.append((doc, list(desc.values())[0]))

    # Return the list of document-metadata pairs
    return list_of_docs

Instead of passing texts directly in the code, we extracted the embeddings and then passed them to Choma’s query method.

So, here’s the complete code for app.py.

# Import the necessary libraries
import gradio as gr
from chromadb import Client, Settings
from clip_embeddings import ClipEmbeddingsfunction

# Initialize a chromadb client with persistent storage
client = Client(Settings(is_persistent=True, persist_directory="./clip_chroma"))

# Initialize the ClipEmbeddingsfunction
ef = ClipEmbeddingsfunction()

# Function to retrieve images from a text query
def retrieve_image_from_query(query: str):
    # Get the "clip" collection with the specified embedding function
    coll = client.get_collection(name="clip", embedding_function=ef)
    
    # Get the text embeddings for the input query
    emb = ef.get_text_embeddings(text=query)
    emb = [float(i) for i in emb]
    
    # Query the collection for similar documents
    result = coll.query(
        query_embeddings=emb,
        include=["documents", "metadatas"],
        n_results=4
    )
    
    # Extract documents and their metadata
    docs = result['documents'][0]
    descs = result["metadatas"][0]
    list_of_docs = []
    
    # Combine documents and descriptions into a list
    for doc, desc in zip(docs, descs):
        list_of_docs.append((doc, list(desc.values())[0]))
    
    return list_of_docs

# Function to retrieve images from an uploaded image
def retrieve_image_from_image(image):
    # Get the "clip" collection with the specified embedding function
    coll = client.get_collection(name="clip", embedding_function=ef)
    
    # Get the filename of the uploaded image
    image = image.name
    
    # Query the collection with the image filename
    result = coll.query(
        query_texts=image,
        include=["documents", "metadatas"],
        n_results=4
    )
    
    # Extract documents and their metadata
    docs = result['documents'][0]
    descs = result["metadatas"][0]
    list_of_docs = []
    
    # Combine documents and descriptions into a list
    for doc, desc in zip(docs, descs):
        list_of_docs.append((doc, list(desc.values())[0]))
    
    return list_of_docs

# Function to display an image
def show_img(image):
    return image.name

# Create interface using Blocks
with gr.Blocks() as demo:
    with gr.Row():
        with gr.Column():
            # Text input for query
            query = gr.Textbox(placeholder="Enter query")
            gr.HTML("OR")
            # Image input through file upload
            photo = gr.Image()
            button = gr.UploadButton(label="Upload file", file_types=["image"])
        with gr.Column():
            # Display a gallery of images
            gallery = gr.Gallery().style(
                object_fit='contain',
                height='auto',
                preview=True
            )

    # Define the input and output for the query submission
    query.submit(
        fn=retrieve_image_from_query,
        inputs=[query],
        outputs=
    )
    
    # Define the input and output for image upload
    button.upload(
        fn=show_img,
        inputs=[button],
        outputs=[photo]).\
        then(
            fn=retrieve_image_from_image,
            inputs=[button],
            outputs=
        )

# Launch the Gradio interface if the script is run as the main program
if __name__ == "__main__":
    demo.launch()

Now, launch the app by running gadio app.py in the terminal and visit the local address.

GitHub Repository: https://github.com/sunilkumardash9/multi-modal-search-app

Real-life Use cases

Multi-modal search can have many uses across industries.

E-commerce: Multi-modal search can enhance the customer shopping experience. For example, you can take a photo of a product at a physical store and search for it online to get similar products.
Healthcare: This can help diagnose diseases and find treatments. Doctors could use an image to find clinical research data from a medical database.
Education: Multimodal search-enabled education apps can help students and professors find relevant documents faster. Retrieving texts based on images and vice-versa can save a lot of time.
Customer service: Multimodal search can help streamline searching for relevant answers to customer queries from the knowledge base. These queries may include images or videos of products.

Conclusion

Multi-modal search will be game-changing in the future. Being able to interact in multiple modalities opens up new avenues of growth. So, this article was about using the Chroma vector database and a multi-modal CLIP model to build a basic search app. As the Chroma database does not have out-of-the-box support for multi-modal models, we created a custom CLIP embedding class to get embeddings from images and pieced together different parts to build the food search app.

Key Takeaways

In AI, the multi-modality is to be able to interact with multiple modes of communication, such as text, image, audio, and video.
CLIP is an image-text model trained over thousands of image-text samples with state-of-the-art zero-shot classification ability.
Vector Databases are purpose-built to store, search, and query high-dimensional vectors.
The engines that empower Vector Stores are ANN algorithms. HNSW is one of the most popular and efficient graph-based ANN algorithms.

Frequently Asked Question

Q1. What is multi-modal search?

A. Multimodal search is a new approach to search that combines information from multiple modalities, such as text, images, audio, and video, to improve the accuracy and relevance of search results.

Q2. What is multimodal AI?

A. Multimodal AI refers to the Machine Learning models that can process and understand various modalities of data such as image, text, audio, etc.

Q3. What are the different modalities in AI?

A. Multimodal models have four modes of communication: text, image, video, and audio.

Q4. What is Approximate Nearest Neighbour (ANN)?

A. The approximate nearest neighbor (ANN) is a searching algorithm. It intends to find the “n” closest data points to a given point in a vector space.

Q5. Why do LLMs need a vector database?

A. LLMs need vector databases to efficiently store and retrieve the high-dimensional vector representations of words and phrases used to perform complex mathematical operations such as similarity matching.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sunil Kumar

Meet your author Sunil kumar Dash, a developer and a writer. Has diverse interests in tech, pop culture, wellness, philosophy and Anime. Exploring underrated music is his hobby. And loves to doom scroll Twitter when bored.

Algorithm Artificial Intelligence Classification Database Guide

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

How to Build a Multi-Modal Search App with Chroma?

Introduction

Learning Objectives

Table of contents

What is Multi-modality in AI?

Contrastive Language-Image Pre-Training (CLIP)

Why are Vector Databases Required?

Indexing

What is Gradio?

Building the App

CLIP Embeddings

Load Embeddings

Gradio App

Real-life Use cases

Conclusion

Key Takeaways

Frequently Asked Question

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg