Search engines like Google Images, Bing Visual Search, and Pinterest’s Lens make it seem very easy when we type in a few words or upload a picture, and instantly, we get back the most relevant similar images from billions of possibilities.
Under the hood, these systems use huge stacks of data and advanced deep learning models to transform both images and text into numerical vectors (called embeddings) that live in the same “semantic space.”
In this article, we’ll build a mini version of that kind of search engine, but with a much smaller animal dataset with images of tigers, lions, elephants, zebras, giraffes, pandas, and penguins.
You can follow the same approach with other datasets like COCO, Unsplash photos, or even your personal image collection.
Our image search engine will:
BLIP is a deep learning model capable of producing textual descriptions for photos (also known as image captioning). If our dataset doesn’t already have a description, BLIP can create one by looking at an image, such as a tiger, and producing something like “a large orange cat with black stripes lying on grass.”
This helps especially where:
Read more: Image Captioning Using Deep Learning
CLIP, by OpenAI, learns to connect text and images within a shared vector space.
It can:
Example:

We’ll do everything inside Google Colab, so you don’t need any local setup. You can access the notebook from this link: Embedding_Similarity_Animals
We’ll install PyTorch, Transformers (for BLIP and CLIP), and ChromaDB (vector database). These are the main dependencies for our mini project.
!pip install transformers torch -q
!pip install chromadb -q
For this demo, we’ll use the Animal Dataset from Kaggle.
import kagglehub
# Download the latest version
path = kagglehub.dataset_download("likhon148/animal-data")
print("Path to dataset files:", path)
Move to the /content directory in Colab:
!mv /root/.cache/kagglehub/datasets/likhon148/animal-data/versions/1 /content/
Check what classes we have:
!ls -l /content/1/animal_data
You’ll see folders like:

3. Count Images per Class
Just to get an idea of our dataset.
import os
base_path = "/content/1/animal_data"
for folder in sorted(os.listdir(base_path)):
folder_path = os.path.join(base_path, folder)
if os.path.isdir(folder_path):
count = len([f for f in os.listdir(folder_path) if os.path.isfile(os.path.join(folder_path, f))])
print(f"{folder}: {count} images")
Output:

We’ll use CLIP for embeddings.
from transformers import CLIPProcessor, CLIPModel
import torch
model_id = "openai/clip-vit-base-patch32"
processor = CLIPProcessor.from_pretrained(model_id)
model = CLIPModel.from_pretrained(model_id)
device = 'cuda' if torch.cuda.is_available() else 'cpu'
model.to(device)
BLIP will create a caption for each image.
from transformers import BlipProcessor, BlipForConditionalGeneration
blip_model_id = "Salesforce/blip-image-captioning-base"
caption_processor = BlipProcessor.from_pretrained(blip_model_id)
caption_model = BlipForConditionalGeneration.from_pretrained(blip_model_id).to(device)
We’ll gather all image paths from the dataset.
image_paths = []
for root, _, files in os.walk(base_path):
for f in files:
if f.lower().endswith((".jpg", ".jpeg", ".png", ".bmp", ".webp")):
image_paths.append(os.path.join(root, f))
For each image:
import pandas as pd
from PIL import Image
records = []
for img_path in image_paths:
image = Image.open(img_path).convert("RGB")
# BLIP: Generate caption
caption_inputs = caption_processor(image, return_tensors="pt").to(device)
with torch.no_grad():
out = caption_model.generate(**caption_inputs)
description = caption_processor.decode(out[0], skip_special_tokens=True)
# CLIP: Get image embeddings
inputs = processor(images=image, return_tensors="pt").to(device)
with torch.no_grad():
image_features = model.get_image_features(**inputs)
image_features = image_features.cpu().numpy().flatten().tolist()
records.append({
"image_path": img_path,
"image_description": description,
"image_embeddings": image_features
})
df = pd.DataFrame(records)
We push our embeddings into a vector database.
import chromadb
client = chromadb.Client()
collection = client.create_collection(name="animal_images")
for i, row in df.iterrows():
collection.add( # upserting to our chroma collection
ids=[str(i)],
documents=[row["image_description"]],
metadatas=[{"image_path": row["image_path"]}],
embeddings=[row["image_embeddings"]]
)
print("✅ All images stored in Chroma")
9. Create a Search Function
Given a text query:
import matplotlib.pyplot as plt
def search_images(query, top_k=5):
inputs = processor(text=[query], return_tensors="pt", truncation=True).to(device)
with torch.no_grad():
text_embedding = model.get_text_features(**inputs)
text_embedding = text_embedding.cpu().numpy().flatten().tolist()
results = collection.query(
query_embeddings=[text_embedding],
n_results=top_k
)
print("Top results for:", query)
for i, meta in enumerate(results["metadatas"][0]):
img_path = meta["image_path"]
print(f"{i+1}. {img_path} ({results['documents'][0][i]})")
img = Image.open(img_path)
plt.imshow(img)
plt.axis("off")
plt.show()
return results
Try some queries:
search_images("a large wild cat with stripes")

search_images("predator with a mane")

search_images("striped horse-like animal")

Remember, this Search Engine would be more effective if we had a much larger dataset, and if we utilised a better description for each image would make much effective embeddings within our unified representation space.
In summary, creating a miniature image search engine using vector representations of text and images offers exciting opportunities for enhancing image retrieval. By utilising BLIP for captioning and CLIP for embedding, we can build a versatile tool that adapts to various datasets, from personal photos to specialised collections.
Looking ahead, features like image-to-image search can further enrich user experience, allowing for easy discovery of visually similar images. Additionally, leveraging larger CLIP models and fine-tuning them on specific datasets can significantly boost search accuracy.
This project not only serves as a solid foundation for AI-driven image search but also invites further exploration and innovation. Embrace the potential of this technology, and transform the way we engage with images.
A. BLIP generates captions for images, creating textual descriptions that can be embedded and compared with search queries. This is useful when the dataset doesn’t already have labels.
A. CLIP converts both images and text into embeddings within the same vector space, allowing direct comparison between them to find semantic matches.
A. ChromaDB stores the embeddings and retrieves the most relevant images by finding the closest matches to a search query’s embedding.