The Year of Disruption: The Top Computer Vision Trends Shaping 2022
Computer Vision is a branch of AI that provides a machine high-level understanding of an image and the power to perform tasks on images that even humans cannot perform. 2022 has been the year of boom for Computer Vision. It has been the most productive year for Computer Vision so far. Many new technologies have been developed, products have been launched, new models have been formed, and a lot of updation has taken place. Out of all the innovations this year, I have shortlisted the top 10 most useful, powerful, and trending hit list of topics in Computer vision in 2022.
Object Detection and Tracking
Object Detection and Tracking is a primary research area in computer vision and were amongst the best in computer vision in 2022. The first work in object detection goes back to 2000. Over the next 20 years, this field has greatly improved and succeeded. The researchers keep striving to build better object detection algorithms. It has a wide range of applications in computer vision, such as self-driving cars, Security and Surveillance, etc. Now, we will try to understand object detection and tracking.
As the name suggests, object detection identifies the object and its location in the image. At the same time, object tracking refers to the ability to identify a particular object and its location in the video. The state-of-the-art (SOTA ) in object detection till 2021 was YOLOv5. When it comes to object tracking, MOT and Deepsort have been widely used tracking algorithms. When it comes to computer vision in 2022, two technologies blew up related to object detection and tracking.
Yolov7 is a recent version of the YOLO family. It’s state-of-the-art for object detection as of today. It outperformed all the other object detection algorithms in speed and accuracy.
Source – github
It can be inferred from the image shown above that YOLOv7 performs 120% faster than all other previous versions of YOLO.
ByteTrack is a multiobject tracking system. Multiobject tracking(MOT) is a process of tracking the path of multiple objects present in a single frame of a video. Most MOT algorithm works on confidence scores and thresholding and discards objects that have low scores. ByteTrack algorithm does not rely on the confidence values of the object detected. Instead, it tracks the path of the low-score objects and finds similarities with the other frames to detect the true object and track the path accordingly instead of discarding the objects with low scores. The low-score objects are matched with the tracklets and the object is detected if it truly exists. ByteTrack model performs well even for objects that are occluded (when the object is hidden behind some other object).
It achieves a total of 80.3 MOTA
Source – github
You can read more about ByteTrack here – https://arxiv.org/abs/2110.06864
Image and Video Generation
Technology has made it possible to generate images and videos based on a basic textual description of a situation or image. This is an exciting computer vision technology in 2022 as it can help you visualize what you imagine and show it to others. Soft-truncation, a universal training technique of score-based diffusion models, achieved state-of-the-art results in image generation till 2021 on CIFAR-10, CelebA, CelebA-HQ 256×256, and STL-10 datasets. There have been advancements in the field and many big companies like meta, google, OpenAI, etc, came up with different technologies and approaches. Some of the tools and technologies launched in 2022 are described below-
Imagen, developed by Google and launched in 2022, is a text-to-image diffusion model. It takes in a description of an image and produces realistic images. Diffusion models are generative models that produce high-resolution images. These models work in two steps. In the first step, some random gaussian noises are added to the image and then in the second step, the model learns to reverse the process by removing the noise, thereby generating new data.
Imagen encodes the text into encodings and then the uses diffusion model to generate an image. A series of diffusion models are used to produce high-resolution images. It is a really interesting technology as you can visualize your creative thinking just by describing an image and generate whatever you want in moments.
Img description – A photo of a Corgi dog riding a bike in Times Square. It is wearing sunglasses and a beach hat.
Source – Google
The text encoder encodes the description of the given and produces encodings on the basis of which the diffusion model produces high-resolution images. Here is the research paper for your reference https://imagen.research.google/paper.pdf
What can be more interesting than generating an image based on the description? Yes! You guessed it right! Make-A-Video, launched by Meta in 2022, allows you to generate a video based on the image’s description. The model uses images with some description to make a video given the description. It also uses unlabeled videos to learn and improve the generated video.
The model concept can be explained in 3 simple steps. The text-to-image generation, taking instances from a set of unsupervised video footages, and then the interpolation network for filling in the frames and forming a video. A bunch of models is used, and their combination to generate high-resolution videos. The link to the research paper is given below if you want to read more about the technology.
Source – makeavideo.studio
You can explore this amazing invention and do many things, like adding motion to a static frame and bringing your imagination to the screen. This technology is exciting and worth exploring. You can read the research paper here: https://arxiv.org/abs/2209.14792
DALL-E2 is an AI system developed by OpenAI and launched in 2022 that can create realistic images and art based on textual descriptions. We have already seen the same technologies, but this system is too worth exploring and spending some time. I found DALL-E2 as one of the best models present, which works on image generation.
It uses a version of GPT-3 modified to generate images and is trained on millions of images from all over the internet. DALL-E uses a combination of NLP techniques to understand the meaning of the input text and computer vision techniques to generate the image. It is trained on a large dataset of images and their associated textual descriptions, which allows it to learn the relationships between words and visual features. DALL-E can generate coherent images with the input text by learning these relationships.
Here is the link to the research paper if you are interested to read in detail: https://arxiv.org/abs/2102.12092
FILM: Frame Interpolation for Large Motion
FILM is another video generation model developed by Google that turns photos that are almost similar and are a few frames apart into a slow-motion video. The model makes it look like the video is captured by a camera in slow motion by interpolating the frames in between the two frames. The distance between the frame proves to be a problem, but that is taken care of by the feature extractor. Further concepts are used that, perhaps, can be taken off in another blog. But just an overview for now.
The codes and the pre-trained model are present on the google-research GitHub for further reference. This model can produce the shot like a live photo in an iPhone in slow motion. It can be very interesting and useful as sometimes we can miss a perfect shot but hey! There is FILM to take care of that….
Source – film-net.github.io
Here is the link to the research paper for your reference: https://arxiv.org/pdf/2202.04901.pdf
Infinite Nature Zero: Generating 3D Flythroughs from Still Photos
I am sure you have seen those crazy shots in movies and specifically in wildlife documentaries where the camera flies over the scenery producing a sick shot of nature or a time-lapse of the cloud, which appears to be very beautiful. The Google research team launched the Infinite Nature Zero model in 2022 that can generate those astonishing shots from still images. Yes! You heard it right, from still images. It has zero in its name as it requires 0 videos to train on to produce videos.
The model uses GANs (Generative adversarial networks) to generate the images. The model uses a self-supervised view generation training pattern by sampling similar views with similar camera angles and trajectories. The model can generate mesmerizing shots without being fed any video.
Source – infinite-nature-zero.github.io
You can look at the research paper for your reference https://arxiv.org/abs/2207.11148
Applying Transformers to Image Problems
Transformers are self-attention neural networks that work on the mechanism of parallelization. It learns the logic and semantics of sentences and can be used for natural language processing. It differs from other NLP techniques as other techniques can process text all at once or in ‘parallel.’ In contrast, the transformer can prove efficient as the words distant from each other can also be contextually compared and analyzed, which results in better predictions and context of the text.
Source – wikipedia
Transformer-like vision transformer outperforms the current sota CNN which are widely used in computer vision task, but there has been a rise in the use of a transformer in this field as it outperforms CNN in terms of accuracy and speed.
Some technologies that blew up in computer vision in 2022:
Scaling Vision Transformers
Vision Transformer(ViT) has attained state-of-the-art results on many computer vision benchmarks. It achieves a total accuracy of 90.45% on the ImageNet dataset which is sota to date. The concept of scaling is the main concept that helps the model achieve the height of accuracy. Some experimenting with the data and model scaling has been done and analyzed, and the final refined architecture has been attained. Here is the link to the research paper if you want to read more about the technology: https://arxiv.org/abs/2106.04560
Pix2Seq: A New Language Interface for Object Detection
Pix2Seq is an object detection algorithm framework launched by Google in 2022. The framework assumes the object detection task as predicting the next word in NLP. The bounding boxes are taken as tokens as the machine is trained to understand the image and generate similar bounding boxes. The model achieves noticeable accuracy on COCO dataset without using approaches like data augmentation and other techniques used by the other algorithms.
Self-Supervised learning has been one of the hottest topics in 2022. Self-supervised learning algorithm does not require explicit labels as input. Rather, it learns from a part of the data itself. Self-supervised learning algorithm solves the problem of our overdependency on labeled data. The auto-generation of the labels makes the problem from unsupervised to a supervised learning problem. Here are a few techniques launched in computer vision in 2022 for self-supervised learning.
Data2Vec was launched earlier this year in January 2022. Data2vec is used in learning patterns in the data and uses the same learning method for speech, NLP, or computer vision. The intuition behind Data2Vec is that the model is given a broken/masked view of the input, like the image shown below. So, the model is given only 20% raw data and is asked to predict the output by analyzing the patterns and learning the abstract representation in the data without being fed thousands of images of a cat like the traditional algorithms.
The model can reconstruct the image as shown below
The model shows state-of-the-art performance as compared to the previously existing methods. You can have a look at the research paper through this link: Research Paper
Improving Transfer Learning
The use of transfer learning has made the life of data science enthusiasts easier. Transfer learning is when we use a pre-trained model like VGG-16 to perform a similar task on a custom dataset. The weights learned by the model are reused and retrained on the data we want to perform the similar task the model was trained for. This saves lots of time and effort as we don’t have to train the model on millions of images and worry about the model’s accuracy and other aspects.
Although transfer learning was fast, there has been some more improvement in 2022 which is better for us. Here are a few technologies that blew up in 2022.
Robust Fine-Tuning of Zero-Shot Models
Zero-shot models, as the name suggests, are models that are not fine-tuned on a specific dataset. This technology proved accurate on a particular distribution but reduces robustness under distribution shifts. The robust fine-tuning of zero-shot models has improved the accuracy of these models under distribution shifts. The concept used is the harmoniously using the weights of the zero-shot model and fine-tuned model(WiSE-FT).
This improves the accuracy under distribution shifts to 4-6% and 1.6 percentage points on the ImageNet dataset. Read more about Robust fine-tuning for zero-shot models here: https://arxiv.org/abs/2109.01903
Few-Shot Object Detection With Fully Cross-Transformer
Few-shot object detection refers to the task in which the model detects the unseen classes with very few training examples. This method solves the problem of our over-dependence on thousands of annotated images. The previous sota used a two-branched-based siamese network for few-shot object detection. There were a few problems in the network used which are taken care of by the fully cross-transformers-based model.
The concept behind this approach is that it encodes a set of few images that are given as training examples. It includes cross-transformers in both the backbone and detection head and uses an SGD to optimize training to reduce errors between actual and wrongly predicted classes. Read more about Few-Shot Object Detection With Fully Cross-Transformer here:
This article gives an overview of the latest technologies launched and boomed in computer vision in 2022; there is still a lot to come. We briefly grasped some concepts that are used in these technologies. Computer vision is a vast field that is still working its way up. There is a lot to discover and explore and lots of opportunities in the near future. I encourage you to research the topics we discussed and dive deep into the glory of Computer Vision. I hope you liked the article and look forward to exploring.