YOLOv7- Real-time Object Detection at its Best

Syed Abdul Gaffar Last Updated : 03 Dec, 2023

9 min read

This article was published as a part of the Data Science Blogathon.

Introduction

To achieve complete insight into image or video understanding, we should not only focus on classifying different images but also try to estimate the position of objects in each image precisely. This task is mainly referred to as object detection, which normally consists of different subtasks such as face detection, vehicle detection, person detection, etc. As one of the fundamental computer vision problems, object detection algorithms can provide crucial information for semantic understanding of images and videos and are responsible for many industrial applications, such as image classification, face recognition, autonomous driving, and human behavior analysis. Meanwhile, inheriting these capabilities from the family of neural networks and related learning systems, the progress in these fields greatly impacts the computer vision field and its reach towards different horizons. Let’s have a quick comparison.

Focusing on Object detection models, there are many different object detection models which perform well for certain use cases, but the recent release of YOLOv7, where the researcher claimed that it outperforms all known object detectors in both speed and accuracy and has the highest accuracy 56.8% AP among all known real-time object detectors. The proposed YOLOv7 version E6 performed better than transformer-based detectors like SWINL Cascade-Mask R-CNNR-CNN in terms of speed and accuracy. YOLOv7 outperformed YOLOX, YOLOR, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, and Vit-Adapter-B. In further reading, we will see what made YOLOv7 outperform these models.

Source: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Introduction
Model Re-parameterization
Model Architecture
- Extended efficient layer aggregation networks (E-ELAN)
- Model scaling for concatenation-based models
Trainable Bag-of-freebies
Coarse for Auxiliary and Fine for Lead Loss
What are the latest features in YOLOv7?
Training YOLOv7
Conclusion
Frequently Asked Questions

Model Re-parameterization

This technique can significantly increase the performance of the architecture. Model re-parameterization technique merges multiple computational modules into one at the inference stage, thus giving us better inference throughput. It can be regarded as an ensemble technique and can be divided into two categories, i.e., module-level ensemble and model-level ensemble. There are two methods for model-level reparameterization. One is to train multiple identical models with different training data and then average the weights of all the models. The second method is that it performs a weighted average for the weights of models at different iterations. Whereas the Module-level method splits a module into multiple identical or different module branches during training and integrates the branched modules into one during inference.

Model Architecture

Extended efficient layer aggregation networks (E-ELAN)

Source: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object
detectors

Extended efficient layer aggregation networks primarily focus on a model’s number of parameters and computational density. The VovNet (CNN seeks to make DenseNet more efficient by combining all features only once in the last feature map) model and the CSPVoVNet model analyses the influence of the input/output channel ratio and the element-wise operation on the network inference speed. YOLO v7 extended ELAN and called it E-ELAN. The major advantage of ELAN was that by controlling the gradient path, a deeper network can learn and converge more effectively.

E-ELAN majorly changes the architecture in the computational block, and the architecture of the transition layer is entirely unchanged. It uses expand, shuffle, and merge techniques which enhances the learning ability of the network without destroying the original gradient path. The strategy here is to use group convolution to expand the channel and number of computational blocks, which applies the same group parameter and channel multiplier to all the computational blocks of a computational layer. Then, the feature map calculated by each computational block is shuffled and then concatenated together. Hence, the number of channels in each group of the feature maps will be equal to the number of channels in the original architecture. Finally, merge these groups of feature maps. E-ELAN also achieved the capability to learn more diverse features.

Model scaling for concatenation-based models

Source: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

The main motive of model scaling is to regulate a number of the model attributes and generate models comprised of various scales to increase different inference speeds. However, if these methods are applied to the concatenation-based architecture when cutting down or scaling up is performed in-depth, there will be an increase and decrease within the in-degree of a translation layer which is straight away after a concatenation-based computational block.

We can infer that can’t be analyzed different scaling factors separately for a concatenation-based model but must be considered together. Take scaling-up depth as an example, such an action will cause a ratio change between the input channel and output channel of a transition layer, which can cause a decrease in the hardware usage of the model. YOLO v7 compound scaling method can maintain the properties that the model had at the initial design and maintains the optimal structure still, this can be because while scaling the depth factor of a computational block, one should also consider calculating the change of the output channel of that block. Then, perform width factor scaling with the identical amount of change on the transition layers, this maintains the properties that the model had at the initial design and maintains the optimal structure.

Trainable Bag-of-freebies

Planned re-parameterized convolution

Source: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

YOLOv7 proposed planned re-parameterized convolution. In this proposed planned re-parameterized model, authors had found that a layer with residual or concatenation connections, its RepConv should not have an identity connection. Under these circumstances, it can be replaced by RepConvN which contains no identity connections.

RepConv combines 3 × 3 convolutions, 1 × 1 convolution, and identity connection in one convolutional layer. After analyzing the combination and corresponding performance of RepConv and different architectures authors used RepConv without identity connection (RepConvN) to design the architecture of planned re-parameterized convolution. According to the paper, when a convolutional layer with residual or concatenation is replaced by re-parameterized convolution, there should be no identity connections.

Coarse for Auxiliary and Fine for Lead Loss

Source: YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

The Deep supervision technique is often used in training deep neural networks. It adds an extra auxiliary head in the middle layers of the network, and the shallow network weights with assistant loss as the guide. Here in yolov7 authors call the head responsible for the final output the lead head, and the head used to assist in the training is called the auxiliary head.

Lead head guided label assigner

It is mainly calculated based on the prediction result of the lead head and the ground truth, which generates a soft label through the optimization process. This set of soft labels will be used as the target training model for both the auxiliary head and lead head. According to the paper, the lead head has a relatively strong learning ability, so the soft label generated from it should be more representative of the distribution and correlation between the source data and the target.

Coarse-to-fine lead head guided label assigner

It uses the predicted result of the lead head and the ground truth to generate a soft label. However, in the process, it generates two different sets of soft labels, that is the coarse label and fine label, where the fine label is the same as the soft label generated by the lead head guided label assigner, and the coarse label is generated by allowing more grids to be treated as a positive target. This mechanism allows the importance of fine and coarse labels to be dynamically adjusted during the training process.

What are the latest features in YOLOv7?

Enhanced Features: YOLOv7 has upgraded capabilities for faster and more accurate object detection.
Improved Object Detection: It is better at quickly and accurately recognizing objects in images and videos.
Versatility: The enhancements make YOLOv7 more robust and suitable for various tasks and applications.

Collectively, these improvements make YOLOv7 a more effective and advanced object detection model.

Training YOLOv7

Now let us train yolov7 on a small public dataset.

I suggest you use Google Colab for training. Create a new notebook and change the runtime type to GPU and execute the below code. This will clone the yolov7 repository and install the necessary modules in the colab environment for training.

# Download YOLOv7 repository and install requirements
!git clone https://github.com/WongKinYiu/yolov7
%cd yolov7
!pip install -r requirements.txt

Then, we will download the dataset and extract it into a folder. Execute the below cell to download and extract the dataset. The dataset is also available here. The dataset contains 7 classes of aquatic animals.

!gdown https://drive.google.com/u/1/uc?id=1pmlMhaOw9oUqIH7OZP8d0dBKVOLqAYcy&export=download
!unzip /content/yolov7/data.zip -d dataset

Once the dataset is downloaded, we need to download the YOLOv7 weight file, since training from scratch takes more time, we will finetune already pretrained weight on our dataset. Here we will download yolov7.pt

%cd /content/yolov7
!wget "https://github.com/WongKinYiu/yolov7/releases/download/v0.1/yolov7.pt"

Now we are good to go for the training. We will train the model for 55 epochs with a batch size of 16. For better results, longer training is suggested (more epochs). Make sure the paths are correct in the ‘data.yaml’ file inside the ‘yolov7/dataset’ folder, also make sure paths are correct in the training command and execute the below code. This will start the training.

%cd /content/yolov7
!python train.py --batch 16 --cfg cfg/training/yolov7.yaml --epochs 55 --data /content/yolov7/dataset/data.yaml --weights 'yolov7.pt' --device 0

Source: Author

Once the training is completed, the best weight is saved at the mentioned path, usually inside the ‘runs’ folder. Once we get the weight file, we can be ready for evaluation and inferencing.

Now for testing run the below piece of code. This will perform inference on the test folder. Make sure that the weight is set to trained weights and here we will be using the confidence threshold of 0.5, you can train more and play around with the parameters for better performance.

# Run evaluation
!python detect.py --weights runs/train/exp/weights/best.pt --conf 0.5 --source /content/yolov7/dataset/test/images

To view the inference images, run the below code. This will display the output images one by one.

# display inference on ALL test images
import glob
from IPython.display import Image, display
i = 0
limit = 10000 # max images to print
for imageName in glob.glob('/content/yolov7/runs/detect/exp/*.jpg'): #assuming JPG
    if i < limit:
      display(Image(filename=imageName))
      print("n")
    i = i + 1

Source: Author (output from the custom-trained model)

The Notebook is also available here.

Summary

YOLOv7 Architecture

· Extended Efficient Layer Aggregation Network (E-ELAN)

E-ELAN mainly concentrates on the computational density and parameters of the model architecture. The major advantage of ELAN was that by controlling the gradient path, a deeper network can learn and converge more effectively.

· Model Scaling for Concatenation based Models

The model scaling method for concatenation-based models is performed by scaling depth in a computational block and width scaling in the remaining transmission layers.

Trainable Bag of Freebies

· Planned re-parameterized convolution

In Planned re-parameterized models, a layer with residual or concatenation connections, its RepConv should not have an identity connection, It can be replaced by a RepConvN layer that contains no identity connections.

· Coarse for auxiliary and fine for lead loss

This label assigner is optimized by lead head predictions and ground truth labels to get the labels of the training lead head and auxiliary head at the same time.

Conclusion

Key takeaways

– This is the in-depth analysis of YOLOv7 architecture and the custom training procedure.

– We can use this model for detection as well as tracking.

– The same can be used to count the objects in frames.

Yolov7 is the new state-of-the-art real-time object detection model. You can use it for different industrial applications. Also, you can optimize the model, that is, converting the model to ONNX, TensorRT, etc, which will increase the throughput and run the edge devices. In this blog, we discussed only the basic step for training YoloV7.

Thank you, and Happy Learning.😄

Frequently Asked Questions

Q1.What makes YOLOv6 different from YOLOv7?

YOLOv7 is a newer version with improvements over YOLOv6, offering better features and performance.

Q2.What is the main part of YOLOv7?

The main part, or backbone, of YOLOv7, is the core component that helps it understand images more effectively.

Q3.Is YOLO faster than TensorFlow?

Yes, YOLO is faster than TensorFlow for recognizing objects in images or videos.

Q4.What are the benefits of YOLOv7?

YOLOv7 has advantages such as quick and accurate object recognition, making it valuable for various tasks.

Reference: https://arxiv.org/abs/2207.02696

My LinkedIn

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Syed Abdul Gaffar

I thrive on the thrill of the challenge, tackling complex problems and crafting innovative AI solutions that make a difference. Whether it's optimizing or building sustainable AI ecosystems, I believe in harnessing the power of AI for the greater good. Let's brainstorm, collaborate, and change the world, one byte at a time.

Computer Vision Intermediate Object Detection Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Model Deployment

Introduction to Computer Vision

Getting Started with Image Data

Introduction to CNN and Implementation

Introduction to Transfer Learning

CNN Visualization

Overview of Pretrained Models

Inception

ResNets

DenseNets

CSRNet

Introduction to Object Detection

Region Based Convolutional Neural Network

Single Stage Networks

Transformed Based Object Detection Models

Face Detection

Object Tracking

Pose Estimation

Introduction to Image Segmentation

Understanding Deep Learning Architectures for Image Segmentation

Video Classification

Introduction to Image Generation

Zero and Few Shot Learning

YOLOv7- Real-time Object Detection at its Best

Introduction

Table of contents

Model Re-parameterization

Model Architecture

Extended efficient layer aggregation networks (E-ELAN)

Model scaling for concatenation-based models

Trainable Bag-of-freebies

Coarse for Auxiliary and Fine for Lead Loss

What are the latest features in YOLOv7?

Training YOLOv7

Summary

Conclusion

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics