Every once in a while, a machine learning framework or library changes the landscape of the field. Today, Facebook open sourced one such framework – DETR, or DEtection TRansformer.
In this article, we’ll quickly understand the concept of object detection and then dive straight into DETR and what it brings to the table.
Object Detection at a Glance
In Computer Vision, object detection is a task where we want our model to distinguish the foreground objects from the background and predict the locations and the categories for the objects present in the image. Current deep learning approaches attempt to solve the task of object detection either as a classification problem or as a regression problem or both.
For example, in the RCNN algorithm, several regions of interest are identified from the input image. Then these regions are classified as either objects or background and finally, a regression model is used to generate the bounding boxes for the identified objects.
The YOLO framework (You Only Look Once) on the other hand, deals with object detection in a different way. It takes the entire image in a single instance and predicts the bounding box coordinates and class probabilities for these boxes.
To learn more about object detection refer to these articles:
- A Step-by-Step Introduction to the Basic Object Detection Algorithms
- A Practical Guide to Object Detection using the Popular YOLO Framework
Introducing DEtection TRansformer (DETR) by Facebook AI
As you saw in the previous section, the current deep learning algorithms perform object detection in a multi-step manner. They also suffer from the problem of near-duplicates, i.e., false positives. To simplify, the researchers at Facebook AI has come up with DETR, an innovative and efficient approach to solve the object detection problem.
This new model is quite simple and you don’t have to install any library to use it. DETR treats an object detection problem as a direct set prediction problem with the help of an encoder-decoder architecture based on transformers. By set, I mean the set of bounding boxes. Transformers are the new breed of deep learning models that have performed outstandingly in the NLP domain.
In the results, the DETR achieved comparable performances. More precisely, DETR demonstrates significantly better performance on large objects. However, it didn’t perform that well on small objects. I am sure the researchers will work that out pretty soon.
Architecture of DETR
The overall DETR architecture is actually pretty easy to understand. It contains three main components:
- a CNN backbone
- an Encoder-Decoder transformer
- a simple feed-forward network
Here, the CNN backbone generates a feature map from the input image. Then the output of the CNN backbone is converted into a one-dimensional feature map that is passed to the Transformer encoder as input. The output of this encoder are N number of fixed length embeddings (vectors), where N is the number of objects in the image assumed by the model.
The Transformer decoder decodes these embeddings into bounding box coordinates with the help of self and encoder-decoder attention mechanism.
Finally, the feed-forward neural networks predict the normalized center coordinates, height, and width of the bounding boxes and the linear layer predicts the class label using a softmax function.
This is a really exciting framework for all deep learning and computer vision enthusiasts. A huge thanks to Facebook for sharing its approach with the community.
Time to buckle up and use this for our next deep learning project!You can also read this article on our Mobile APP