Performing Speech and Object Recognition using just One Model with MIT’s ML System

Pranav Dar 07 May, 2019 • 2 min read


  • Researchers from MIT have developed a model that combined the speech recognition and object detection
  • The model was trained on 400,000 image-caption pairs, and 1,000 random pairs were used as the test set
  • A convolutional neural network (CNN) is at the heart of the model



The state-of-the-art models in deep learning are able to detect objects in images and perform speech recognition, but separately using different models for each. If I told you that there’s a way to build one model that can combine these two functions, you’d most likely claim that this is what AGI is supposed to be, and we’re nowhere near that!

Well, I have news for you – MIT researchers have designed such a model. Given an image and an audio caption, the model highlights the relevant regions in the image that are spoken about. And it does this in near real-time. Granted the research is still in a very nascent stage and the model can only recognize a few hundred different words and objects, but this will surely get your excitement levels up if you’re into machine learning.

MIT computer scientists have developed a system that learns to identify objects within an image, based on a spoken description of the image.

Source: MIT News

The researchers built on a slightly older research in this area. They used the previous study to associate particular words with specific pixel patches. The model was then trained on a classification database on the Mechnical Turk platform using a total of 400,000 image-caption pairs. 1,000 random pairs were held out and used as the test set. It won’t surprise seasoned deep learning users to know that a convolutional neural network (CNN) is at the heart of the model.

There are two types of CNNs are play here – an image analyzing one, and an audio analyzing one.

MIT’s team have written a very detailed blog post explaining the intricacies of this technology and you can check it out here.


Our take on this

Once the researchers perfect (or at least improve) the model, this system has the capability to save hours upon hours of manual effort. And it will do wonders for speech recognition and object detection, won’t it? Imagine the possibilities!

According to the blog post mentioned above, there are approximately 7,000 spoken languages in the world, and only 100 of them have enough data to be used in a speech recognition model. Once this system is fine tuned and ready for action, this number could rise significantly higher.


Pranav Dar 07 May 2019

Senior Editor at Analytics Vidhya.Data visualization practitioner who loves reading and delving deeper into the data science and machine learning arts. Always looking for new ways to improve processes using ML and AI.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


  • [tta_listen_btn class="listen"]