Introducing Moondream2: A Tiny Vision-Language Model

Vikas Verma 30 Mar, 2024 • 7 min read

Vision Language models are the models that can process and understand both visual and language(textual input) data simultaneously. These models combine techniques from Computer Vision and Natural Language Processing to understand and generate text based on the image content and language instruction.

There are many large vision language models available such as OpenAI’s GPT-4v, Salesforce’s BLIP-2,
MiniGPT4, LLaVA, etc. to perform various image-to-text generation tasks like image captioning, visual question-answering, visual reasoning, text recognition etc. But like any other Large Language Models , these models also require heavy computational resources and exhibit slower inference speed or throughput.

On the other hand, Small Language Models (SLMs) use less memory and processing power which make them ideal for devices with limited resources. They are generally trained on much smaller and more specialized datasets. In this article, we will explore Moondream2 (a small vision-language model), its components, capabilities, and limitations.

Learning Objectives

  • Understand the need for small language models in the context of multi-modalities.
  • Explore moondream2 and its components.
  • Gain hands-on exposure to implementing moondream2 using python.
  • Learn about the limitations and performance of moondream2 on various benchmarks.

This article was published as a part of the Data Science Blogathon.

What is Moondream2?

Moondream is an open source tiny vision language model that can easily run on devices with low-resource settings. Essentially, it’s a 1.86 billion parameter model initialized with weights from SigLIP and Phi-1.5. It is good at answering questions about images, generating captions for them, and undertaking various other vision language tasks.

Components of Moondream2

Moondream2 has two major components:

  • SigLIP
  • Phi-1.5


The SigLIP (Sigmoid Loss for Language Image Pre-Training) model is similar to the CLIP (Contrastive Language–Image Pre-training) model. It replaces the softmax loss used in CLIP with a simple pairwise sigmoid loss. This modification leads to better performance on zero-shot classification and image-text retrieval tasks. Thus, the sigmoid loss operates solely on image-text pairs, eliminating the need for
a global view of pairwise similarities across all pairs within a batch. The sigmoid loss enables the scaling up of batch sizes while also improving performance even with smaller batch sizes.


Phi-1.5 is a transformer-based small language model with 1.3 billion parameters. It was introduced by Microsoft researchers in the paper “Textbooks Are All You Need II: phi-1.5 technical report”. Essentially, it’s the successor of Phi-1. The model demonstrates remarkable performance across various benchmarks, including common sense reasoning, multi-step reasoning, language comprehension, and knowledge understanding, outperforming its 5x larger counterparts. Phi-1.5 was trained on a dataset comprising 30 billion tokens, which included 7 billion tokens from the training data of Phi-1, along with approximately 20 billion tokens generated synthetically from GPT-3.5.

Implementation of Moondream2

Let us now see the Python implementation of moondream2 using transformers.


We need to install transformers, timm (PyTorch Image Models), and einops (Einstein Operations) first before utilizing the model.

pip install transformers timm einops

Now let’s load the tokenizer and model using transformers’s AutoTokenizer and AutoModelForCausalLM
modules respectively. Since the model undergoes regular updates so it’s recommended to specify a particular release when pinning the model version as shown below.

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
model_id = "vikhyatk/moondream2"
revision = "2024-03-13"
model = AutoModelForCausalLM.from_pretrained(
    model_id, trust_remote_code=True, revision=revision

Note: To load the model onto the GPU, enable the Flash Attention on the text model by passing in attn_implementation=”flash_attention_2″ while instantiating the model.

Now let’s test the model for various vision-language tasks.

1. Image Captioning (Image Description)

As the name suggests, it is the task of describing the content of an image in words. Let’s see with an example.

from PIL import Image
image ='busy street.jpg')
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "Describe this image in detail", tokenizer)

class color:
   BLUE = '\033[94m'
   BOLD = '\033[1m'
   END = '\033[0m'
print(color.BOLD+color.BLUE+"Input:"+color.END, "Describe this image in detail")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)



So, the model generates a detailed description of the image by identifying the objects (such as clock tower, buildings, buses, people, etc.) and their activities.

Using moondream2 personalized image-to-text descriptions can also be generated as shown in the
below example.

image ='cat and dog.jpg')
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "Write a conversation between the two", tokenizer)

print(color.BOLD+color.BLUE+"Input:"+color.END, "Write a conversation between the two")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)



2. Visual Question-Answering (Visual Conversation)

VQA (Visual Question Answering) is about answering open-ended questions about an image. We pass in the image and the question as input to the model.

image ='girl and cats.jpg')
Visual Question-Answering (Visual Conversation)
enc_image = model.encode_image(image)
answer1 = model.answer_question(enc_image, "How many cats the girl is holding?", tokenizer)
answer2 = model.answer_question(enc_image, "what is their color?", tokenizer)

print(color.BOLD+color.BLUE+"Question 1:"+color.END, "How many cats the girl is holding?")
print(color.BOLD+color.BLUE+"Answer 1:"+color.END, answer1)
print(color.BOLD+color.BLUE+"Question 2:"+color.END, "what is their color?")
print(color.BOLD+color.BLUE+"Answer 2:"+color.END, answer2)



The model correctly answers the above two questions regarding the image.

3. Visual story-telling/poem-writing

Telling a story or writing poems using images. For example:

image ='beach sunset.jpg')
Visual story-telling/poem-writing
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "Write a beautiful poem about this image", tokenizer)

print(color.BOLD+color.BLUE+"Input:"+color.END, "Write a beautiful poem about this image")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)



The model writes a beautiful poem as per the contents of the input image.

4. Visual Knowledge Reasoning

Visual knowledge reasoning involves integrating external knowledge and facts, extending beyond the visible content, to address questions effectively.

image ='the great wall of China.jpg')
Visual Knowledge Reasoning
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "Tell about the history of this place", tokenizer)

print(color.BOLD+color.BLUE+"Input:"+color.END, "Tell about the history of this place")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)



The model identifies the image as the Great Wall of China and tells its history.

5. Visual Commonsense Reasoning

Answering the questions by leveraging common knowledge and contextual understanding of the visual world evoked by the image. For example:

image ='man and dog.jpg')
Visual Commonsense Reasoning
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "what does the man feel and why?", tokenizer)

print(color.BOLD+color.BLUE+"Input:"+color.END, "what does the man fell and why?")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)



6. Text Recognition

Image text recognition refers to the process of automatically identifying and extracting text content from images, like OCR.

image ='written quote.jpg')
text Recognition
enc_image = model.encode_image(image)
output = model.answer_question(enc_image, "what's written on this piece of paper?", tokenizer)

print(color.BOLD+color.BLUE+"Input:"+color.END, "what's written on this piece of paper?")
print(color.BOLD+color.BLUE+"Response:"+color.END, output)

Output :


Benchmark Results

Having seen the model implementation, now let’s look at the model performance on various standard benchmarks such as VQAv2, GQA, TextVQA, TallyQA, etc.

Benchmark Results

Limitations of Moondream2

Moondream2 is specifically designed to answer questions about images. It has the following

  • It may struggle with theoretical or abstract questions that demand multi-step reasoning, such as “Why would a cat do that?”. Because the images are sampled down to 378×378 and the model might find it challenging to address queries about very minute details within the image.
  • It has limited ability to perform OCR on images containing textual content.
  • It may struggle with accurately counting items beyond two or three.
  • The model may produce offensive, inappropriate, or hurtful content if prompted to do so.


This article delves into Moondream2, a compact vision-language model tailored for resource-constrained devices. By dissecting its components and demonstrating its prowess through various image-to-text tasks, Moondream2 proves its utility in real-world applications. However, its limitations, such as difficulty with abstract queries and limited OCR capabilities, underscore the need for continual refinement. Nevertheless, Moondream2 heralds a promising avenue for efficient multi-modal understanding and generation, offering practical solutions across diverse domains.

Key Takeaways

  • Moondream2 is a small, open-source vision-language model designed for devices with limited resources.
  • Python implementation of Moondream2 using transformers, enabling tasks like image captioning, visual question-answering, story-telling, and more.
  • Moondream2’s compact size makes it suitable for deployment in retail analytics, robotics, security, and other domains with limited resources.
  • Promising avenue for efficient multi-modal understanding and generation, offering practical solutions in various industries.

Frequently Asked Questions

Q1. What are the benefits of using small language models?

A. Small language models offer several benefits like faster inference, lower resource requirements, cost-effectiveness, scalability, domain-specific applications, interpretability, and ease of deployment.

Q2. What are the major components of moondream2?

A. Moondream2 has two major components – SigLIP and Phi-1.5. SigLIP is a visual encoder similar to the CLIP model to perform zero-shot image classification. Phi-1.5 is part of the Phi series small language models introduced by Microsoft, it has 1.3 billion parameters.

Q3. How many parameters does moondream2 have and how much space does it consume while loading?

A. Moondream2 has 1.86 billion parameters, and it consumes around 9-10 GB of memory while loading.

Q4. What are real-world applications of moondream2?

A. Due to its compact size, this model can operate across devices with limited resources. For instance, it can be deployed in retail settings to gather data and analyze customer behavior. Similarly, it can be used in drone and robotics applications to survey environments and identify significant activities or objects. Additionally, it serves security purposes by analyzing videos and images to detect and prevent incidents.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Vikas Verma 30 Mar 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers