Hack Session: Image Captioning using Attention Models

Nov 14, 2019


Auditorium 3

60 minutes

Computer Vision

Image caption generation is the task of generating a descriptive and appropriate sentence of a given image. For humans, the task looks straightforward with the motive of summarizing the image in a single sentence incorporating the interactions between the various components present in the image. But to replicate this phenomenon in an artificial framework is a very challenging task.

Inspired by the outstanding results of using attention mechanisms in machine translation and other seq2seq tasks, there have been few advancements in the field of computer vision using attention techniques. In this hack session, we incorporate visual attention mechanism in generating relevant captions from images using a deep learning framework. 


There has been huge advancement in the field of image-captioning starting from the non-neural implementations of generating caption templates and filling using objection detections to recurrent neural network-based approaches. With the success stories of attention in encoder-decoder frameworks, we train an attention-based CNN-LSTM network architecture to generate relevant image captions from the given image.

Key Takeaways for the Audience

  • A drill down on the various past approaches in Image captioning and intuitive reasons of their failure
  • Understanding of receptive fields in CNN’s
  • Motivation of using attention in an encoder-decoder based framework
  • About different variations and intuitions of Attention mechanism in an encoder-decoder framework
  • Code Level understanding of the entire framework of visual attention based captioning


Check out the video below to know more about the session.

  • Souradip Chakraborty

    Statistical Research Analyst

    Walmart Labs

  • Rajesh Shreedhar Bhat

    Data Scientist

    Walmart Labs

Copyright 2019 Analytics Vidhya. All rights reserved