Step by Step Guide to Build Image Caption Generator using Deep Learning

Shikha Gupta 20 May, 2024
12 min read

Whenever an image appears in front of us, our brain can annotate or label it. But what about computers? How can a machine process and label an image with a highly relevant and accurate caption? It seemed quite impossible a few years back. Still, with the enhancement of Computer Vision and Deep learning algorithms, the availability of relevant datasets, and AI models, it becomes easier to build a relevant caption generator for an image. Even Caption generation is growing worldwide, and many data annotation firms are earning billions. In this guide, we will build one such annotation tool capable of generating relevant captions for the image with the help of datasets. Basic knowledge of two Deep learning techniques, including LSTM and CNN, is required. In this, Article We are talking about how you can convert image to caption and also get some insights on CNN, LSTM . These all information is provided by you so at the end of this article you will get to know about the Image to Caption and full understanding on image caption generator using deep learning.

This article was published as a part of the Data Science Blogathon

What is Image to Caption Generator?

Image caption generator is a process of recognizing the context of an image and annotating it with relevant captions using deep learning and computer vision. It includes labeling an image with English keywords with the help of datasets provided during model training. The imagenet dataset trains the CNN model called Xception. Xception is responsible for image feature extraction. These extracted features will be fed to the LSTM model, which generates the image caption.

What is CNN?

CNN is a subfield of Deep learning and specialized deep neural networks used to recognize and classify images. It processes the data represented as 2D matrix-like images. CNN can deal with scaled, translated, and rotated imagery. It analyzes the visual imagery by scanning them from left to right and top to bottom and extracting relevant features. Finally, it combines all the parts for image classification.

What is LSTM?

Being a type of RNN (recurrent neural network), LSTM (Long short-term memory) is capable of working with sequence prediction problems. It is mostly used for the next word prediction purposes, as in Google search our system is showing the next word based on the previous text. Throughout the processing of inputs, LSTM is used to carry out the relevant information and to discard non-relevant information.

To build an image caption generator model we have to merge CNN with LSTM. We can drive that:

Image Caption Generator Model (CNN-RNN model) = CNN + LSTM

  • CNN – To extract features from the image. A pre-trained model called Xception is used for this.
  • LSTM – To generate a description from the extracted information of the image.

Dataset for Image Caption Generator

The Flickr_8K dataset represents the model training of image caption generators. The dataset is downloaded directly from the below links. The downloading process takes some time due to the dataset’s large size(1GB). In the image below, you can check all the files in the Flickr_8k_text folder. The most important file is Flickr 8k.token, which stores all the image names with captions. 8091 images are stored inside the Flicker8k_Dataset folder and the text files with captions of images are stored in the Flickr_8k_text folder.


We will use Jupyter notebooks to run our caption generator. You can download Jupyter notebooks from here. A good understanding of Python, Deep learning, and NLP is required for the implementation. If you’re not familiar with these techniques. Please refer to the link below first.

Install below libraries, to begin with, the project:

pip install TensorFlow
pip install Keras
pip install pillow
pip install NumPy
Pip install tqdm
Pip install jupyterlab

Building the Image Caption Generator

Let’s start by opening the jupyter notebook to create our Python3 project. Name your python3 file with train_caption_generate.ipynb

Import all the Required Packages

import numpy as np
from PIL import Image
import os
import string
from pickle import dump
from pickle import load
from keras.applications.xception import Xception #to get pre-trained model Xception
from keras.applications.xception import preprocess_input
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.preprocessing.text import Tokenizer #for text tokenization
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers.merge import add
from keras.models import Model, load_model
from keras.layers import Input, Dense#Keras to build our CNN and LSTM
from keras.layers import LSTM, Embedding, Dropout
from tqdm import tqdm_notebook as tqdm #to check loop progress

Perform Data Cleaning

As we see all image captions are available in the Flickr 8k.token file of the Flickr_8k_text folder. If you analyze this file carefully, you can drive the format of image storing, each image and caption separated by a new line and carry 5 captions numbered from 0 to 4 along with.

Now we are going to define 5 functions for cleaning:

  1. load_fp( filename ) – To load the document file and read the contents of the file into a string.
  2. mg_capt( filename ) – To create a description dictionary that will map images with all 5 captions.
  3. txt_cleaning( descriptions) – This method is used to clean the data by taking all descriptions as input. While dealing with textual data we need to perform several types of cleaning including uppercase to lowercase conversion, punctuation removal, and removal of the number containing words.
  4. txt_vocab( descriptions ) – This is used to create a vocabulary from all the unique words extracted out from descriptions.
  5. save_descriptions( descriptions, filename ) – This function is used to store all the preprocessed descriptions into a file.
  6. Screenshot of Description Dictionary

Code :

# Load the document file into memory
def load_fp(filename):
  # Open file to read
   file = open(filename, 'r')
   text =
return text
# get all images with their captions
def img_capt(filename):
   file = load_doc(filename)
   captions = file.split('n')
   descriptions ={}
for caption in captions[:-1]:
       img, caption = caption.split('t')
if img[:-2] not in descriptions:
           descriptions[img[:-2]] = [ caption ]
return descriptions
#Data cleaning function will convert all upper case alphabets to lowercase, removing punctuations and words containing numbers
def txt_clean(captions):
   table = str.maketrans('','',string.punctuation)
for img,caps in captions.items():
for i,img_caption in enumerate(caps):
           img_caption.replace("-"," ")
           descp = img_caption.split()
          #uppercase to lowercase
           descp = [wrd.lower() for wrd in descp]
          #remove punctuation from each token
           descp = [wrd.translate(table) for wrd in descp]
          #remove hanging 's and a
           descp = [wrd for wrd in descp if(len(wrd)>1)]
          #remove words containing numbers with them
           descp = [wrd for wrd in descp if(wrd.isalpha())]
          #converting back to string
           img_caption = ' '.join(desc)
           captions[img][i]= img_caption
return captions
def txt_vocab(descriptions):
  # To build vocab of all unique words
   vocab = set()
for key in descriptions.keys():
       [vocab.update(d.split()) for d in descriptions[key]]
return vocab
#To save all descriptions in one file
def save_descriptions(descriptions, filename):
   lines = list()
for key, desc_list in descriptions.items():
for desc in desc_list:
           lines.append(key + 't' + desc )
   data = "n".join(lines)
   file = open(filename,"w")
# Set these path according to project folder in you system, like i create a folder with my name shikha inside D-drive
dataset_text = "D:shikhaProject - Image Caption GeneratorFlickr_8k_text"
dataset_images = "D:shikhaProject - Image Caption GeneratorFlicker8k_Dataset"
#to prepare our text data
filename = dataset_text + "/" + "Flickr8k.token.txt"
#loading the file that contains all data
#map them into descriptions dictionary 
descriptions = img_capt(filename)
print("Length of descriptions =" ,len(descriptions))
#cleaning the descriptions
clean_descriptions = txt_clean(descriptions)
#to build vocabulary
vocabulary = txt_vocab(clean_descriptions)
print("Length of vocabulary = ", len(vocabulary))
#saving all descriptions in one file
save_descriptions(clean_descriptions, "descriptions.txt")

Extract the Feature Vector

Now we are going to use the pre-trained model called Xception which is already trained with large datasets to extract the features from these models. Xception was trained on an imagenet dataset with 1000 different classes to classify the images. We can use keras.applications to import this model directly. We need to do a few changes to the Xception model to integrate it with our model. The xception model takes 299*299*3 image size as input so we need to delete the last classification layer and extract out the 2048 feature vectors.

model = Xception( include_top=False, pooling=’avg’ )

Extract_features() function is used to extract these features for all images. At the end we will put the features dictionary into a pickle file named “features.p”.

def extract_features(directory):
       model = Xception( include_top=False, pooling='avg' )
       features = {}
for pic in tqdm(os.listdir(dirc)):
           file = dirc + "/" + pic
           image =
           image = image.resize((299,299))
           image = np.expand_dims(image, axis=0)
          #image = preprocess_input(image)
           image = image/127.5
           image = image - 1.0
           feature = model.predict(image)
           features[img] = feature
return features
#2048 feature vector
features = extract_features(dataset_images)
dump(features, open("features.p","wb"))
#to directly load the features from the pickle file.
features = load(open("features.p","rb"))

Loading dataset for model training

A file named “Flickr_8k.trainImages.txt” is present in our Flickr_8k_test folder. This file carries a list of 6000 image names that are used for the sake of training.

Functions required to load the training datasets:

  • load_photos( fname ) – This function will take a file name as a parameter and return the list of image names by loading the text file into a string.
  • load_clean_descriptions( fname, image) – This function stores the captions for every image from the list of photos to a dictionary. For the ease of the LSTM model in identifying the beginning and ending of a caption, we append the and identifier with each caption.
  • load_features(photos) – The extracted feature vectors from the Xception model and the dictionary for photos are returned by this function.

Code :

#load the data
def load_photos(filename):
   file = load_doc(filename)
   photos = file.split("n")[:-1]
return photos
def load_clean_descriptions(filename, photos):
  #loading clean_descriptions
   file = load_doc(filename)
   descriptions = {}
for line in file.split("n"):
       words = line.split()
if len(words)<1 :
       image, image_caption = words[0], words[1:]
if image in photos:
if image not in descriptions:
               descriptions[image] = []
           desc = ' ' + " ".join(image_caption) + ' '
return descriptions
def load_features(photos):
  #loading all features
   all_features = load(open("features.p","rb"))
  #selecting only needed features
   features = {k:all_features[k] for k in photos}
return features
filename = dataset_text + "/" + "Flickr_8k.trainImages.txt"
#train = loading_data(filename)
train_imgs = load_photos(filename)
train_descriptions = load_clean_descriptions("descriptions.txt", train_imgs)
train_features = load_features(train_imgs)

Tokenizing the Vocabulary

Machines are not familiar with complex English words so, to process model’sdata they need a simple numerical representation. That’s why we map every word of the vocabulary with a separate unique index value. An in-built tokenizer function is present in the Keras library to create tokens from our vocabulary. We can save them to a pickle file named “tokenizer.p”.


#convert dictionary to clear list of descriptions
def dict_to_list(descriptions):
   all_desc = []
for key in descriptions.keys():
       [all_desc.append(d) for d in descriptions[key]]
return all_desc
#creating tokenizer class
#this will vectorise text corpus
#each integer will represent token in dictionary
from keras.preprocessing.text import Tokenizer
def create_tokenizer(descriptions):
   desc_list = dict_to_list(descriptions)
   tokenizer = Tokenizer()
return tokenizer
# give each word an index, and store that into tokenizer.p pickle file
tokenizer = create_tokenizer(train_descriptions)
dump(tokenizer, open('tokenizer.p', 'wb'))
vocab_size = len(tokenizer.word_index) + 1
Vocab_size #The size of our vocabulary is 7577 words.
#calculate maximum length of descriptions to decide the model structure parameters.
def max_length(descriptions):
   desc_list = dict_to_list(descriptions)
return max(len(d.split()) for d in desc_list)
max_length = max_length(descriptions)
Max_length #Max_length of description is 32

Create a Data generator

For training the model as a supervised learning task we need to feed it with input and output sequences. Total 6000 images with 2048 length feature vector and the caption represented as numbers are present in our training sets. It’s not possible to hold such a large amount of data into memory so we are going to use a generator method that will yield batches.

For example: [x1, x2] are the input of our model, and y act as output, where x1 shows 2048 feature vectors of the image, x2 shows the input text sequence and y shows the output text sequence that is predicted by the model.

x1(feature vector)x2(Text sequence)y(word to predict)
featurestart, twodogs
featurestart, two, dogsdrink
featurestart, two, dogs, drinkwater
featurestart, two, dogs, drink, waterend
#data generator, used by model.fit_generator()
def data_generator(descriptions, features, tokenizer, max_length):
while 1:
for key, description_list in descriptions.items():
          #retrieve photo features
           feature = features[key][0]
           inp_image, inp_seq, op_word = create_sequences(tokenizer, max_length, description_list, feature)
           yield [[inp_image, inp_sequence], op_word]
def create_sequences(tokenizer, max_length, desc_list, feature):
   x_1, x_2, y = list(), list(), list()
  # move through each description for the image
for desc in desc_list:
      # encode the sequence
       seq = tokenizer.texts_to_sequences([desc])[0]
      # divide one sequence into various X,y pairs
for i in range(1, len(seq)):
          # divide into input and output pair
           in_seq, out_seq = seq[:i], seq[i]
          # pad input sequence
           in_seq = pad_sequences([in_seq], maxlen=max_length)[0]
          # encode output sequence
           out_seq = to_categorical([out_seq], num_classes=vocab_size)[0]
          # store
return np.array(X_1), np.array(X_2), np.array(y)
#To check the shape of the input and output for your model
[a,b],c = next(data_generator(train_descriptions, features, tokenizer, max_length))
a.shape, b.shape, c.shape
#((47, 2048), (47, 32), (47, 7577))

Define the CNN-RNN model

From the Functional API, we will use the Keras Model in order to define the structure of the model. It includes:

  • Feature Extractor –With a dense layer, it will extract the feature from the images of size 2048 and we will decrease the dimensions to 256 nodes.
  • Sequence Processor – Followed by the LSTM layer, the textual input is handled by this embedded layer.
  • Decoder – We will merge the output of the above two layers and process the dense layer to make the final prediction.
from keras.utils import plot_model
# define the captioning model
def define_model(vocab_size, max_length):
  # features from the CNN model compressed from 2048 to 256 nodes
   inputs1 = Input(shape=(2048,))
   fe1 = Dropout(0.5)(inputs1)
   fe2 = Dense(256, activation='relu')(fe1)
  # LSTM sequence model
   inputs2 = Input(shape=(max_length,))
   se1 = Embedding(vocab_size, 256, mask_zero=True)(inputs2)
   se2 = Dropout(0.5)(se1)
   se3 = LSTM(256)(se2)
  # Merging both models
   decoder1 = add([fe2, se3])
   decoder2 = Dense(256, activation='relu')(decoder1)
   outputs = Dense(vocab_size, activation='softmax')(decoder2)
  # merge it [image, seq] [word]
   model = Model(inputs=[inputs1, inputs2], outputs=outputs)
   model.compile(loss='categorical_crossentropy', optimizer='adam')
  # summarize model
   plot_model(model, to_file='model.png', show_shapes=True)
return model

Training the Image Caption Generator model

We will generate the input and output sequences to train our model with 6000 training images. We create a function named model.fit_generator() to fit the batches to the model. At last, we save the model to our models folder.

# train our model
print('Dataset: ', len(train_imgs))
print('Descriptions: train=', len(train_descriptions))
print('Photos: train=', len(train_features))
print('Vocabulary Size:', vocab_size)
print('Description Length: ', max_length)
model = define_model(vocab_size, max_length)
epochs = 10
steps = len(train_descriptions)
# creating a directory named models to save our models
for i in range(epochs):
   generator = data_generator(train_descriptions, train_features, tokenizer, max_length)
   model.fit_generator(generator, epochs=1, steps_per_epoch= steps, verbose=1)"models/model_" + str(i) + ".h5")

Testing the Image Caption Generator model

After successful model training, our task is to test the model accuracy by inputting test image data. Let’s create a python file named to load the model and generate predictions.

import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import argparse
ap = argparse.ArgumentParser()
ap.add_argument('-i', '--image', required=True, help="Image Path")
args = vars(ap.parse_args())
img_path = args['image']
def extract_features(filename, model):
           image =
           print("ERROR: Can't open image! Ensure that image path and extension is correct")
       image = image.resize((299,299))
       image = np.array(image)
      # for 4 channels images, we need to convert them into 3 channels
if image.shape[2] == 4:
           image = image[..., :3]
       image = np.expand_dims(image, axis=0)
       image = image/127.5
       image = image - 1.0
       feature = model.predict(image)
return feature
def word_for_id(integer, tokenizer):
for word, index in tokenizer.word_index.items():
if index == integer:
return word
return None
def generate_desc(model, tokenizer, photo, max_length):
   in_text = 'start'
for i in range(max_length):
       sequence = tokenizer.texts_to_sequences([in_text])[0]
       sequence = pad_sequences([sequence], maxlen=max_length)
       pred = model.predict([photo,sequence], verbose=0)
       pred = np.argmax(pred)
       word = word_for_id(pred, tokenizer)
if word is None:
       in_text += ' ' + word
if word == 'end':
return in_text
max_length = 32
tokenizer = load(open("tokenizer.p","rb"))
model = load_model('models/model_9.h5')
xception_model = Xception(include_top=False, pooling="avg")
photo = extract_features(img_path, xception_model)
img =
description = generate_desc(model, tokenizer, photo, max_length)


Image Caption Generator 1
Image Caption Generator 2

End Note

In this guide, we build a deep learning model with the help of CNN and LSTM. We used a very small dataset of 8000 images to train our model, but the business level model used larger datasets of more than 100,000 images for better accuracy. The larger the datasets are higher the accuracy. So, if you want to build a more accurate caption generator you can try this model with large datasets. So, we hope you get the full understanding on how to convert image to caption by learning from this article.

Image caption generator using deep learning

Image caption generators are a fascinating application of deep learning that bridge the gap between computer vision and natural language processing (NLP). Here’s a breakdown of how they work:

CNNs and LSTMs

  • Convolutional Neural Networks (CNNs): These are the workhorses for image recognition. They take an image as input and process it through multiple layers, extracting features like shapes, edges, and colors.
  • Long Short-Term Memory (LSTM) networks: These are a special kind of Recurrent Neural Networks (RNNs) adept at handling sequential data like sentences. In image captioning, LSTMs take the image features extracted by the CNN and use them to predict a word at a time, forming a caption that describes the image content.

The Process:

  1. Data Preparation: A large dataset of images with corresponding captions is required. This trains the model to recognize the relationship between visual features and appropriate language descriptions.
  2. Feature Extraction: The CNN takes an image and extracts a set of features that represent its visual content.
  3. Sequence Prediction: The LSTM network receives the extracted features and starts generating a caption word by word. It considers the previously generated words and the image features to predict the next word in the sequence. This continues until a complete caption is formed.


Image caption generators have a range of applications, including:

  • Improving image accessibility: By generating captions for images, visually impaired users can understand the content.
  • Automatic image tagging: Captions can be used to automatically tag images for better organization and search.
  • Social media image descriptions: Captions can be generated for social media posts to improve engagement.

Frequently Asked Questions

Q1. What is an image caption generator?

A. An image caption generator is a system that generates textual descriptions or captions for images automatically. It combines computer vision techniques to understand the visual content of an image and natural language processing (NLP) techniques to generate descriptive captions.

Q2. What is image captioning in the project description?

A. In a project description, image captioning refers to generating textual descriptions or captions for images using machine learning and NLP techniques. It involves analyzing the visual features of an image and generating a coherent and relevant caption that describes the image’s content.

Q3. What is the dataset for image caption generation? 

A. The dataset for image caption generation typically consists of pairs of images and corresponding captions. These datasets are manually annotated, where human annotators provide descriptive captions for a given set of images. Commonly used datasets include MSCOCO, Flickr8K, and Flickr30K.

Q4. What is the use of LSTM in image captioning?

A. Long Short-Term Memory (LSTM) is a recurrent neural network (RNN) architecture commonly used in image captioning. LSTM networks can capture long-term dependencies in sequential data, making them suitable for generating coherent and contextually relevant captions by modeling the sequential nature of language.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Shikha Gupta 20 May, 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


Vishnu 02 Jun, 2023

Vishnu kant sharma thank you