Social Distancing – the term that has taken the world by storm and is transforming the way we live. Social distancing has become a mantra around the world, transcending languages and cultures.
This way of living has been forced upon us by the fastest growing pandemic the world has ever seen – COVID-19. As per the World Health Organization (WHO), COVID-19 has so far infected almost 4 million people and claimed over 230K lives globally. Around 213 countries have been affected so far by the deadly virus.
The biggest cause of concern is that COVID-19 spreads from person to person through contact or if you’re within close proximity of an infected person. Given how densely populated some areas are, this has been quite a challenge.
The only way to prevent the spread of COVID-19 is Social Distancing. Keeping a safe distance from each other is the ultimate way to prevent the spread of this disease (at least until a vaccine is found).
So this got me thinking – I want to build a tool that can potentially detect where each person is in real-time, and return a bounding box that turns red if the distance between two people is dangerously close. This can be used by governments to analyze the movement of people and alert them if the situation turns serious.
Here’s a taste of the social distancing detection tool we’ll be building:
I would recommend going through the below articles and courses if you need a refresher:
I can vividly recall my initial days learning computer vision. I often got confused between these two terms – Image Classification and Object Detection. I used both of these terms interchangeably assuming the idea behind them was similar. And, as a result, I kept getting confused between deep learning projects. Not ideal!
So, I will kickstart the article by answering this perplexing question – are image classification and object detection one and the same?
Think about it – objects are everywhere! That’s why Object Detection and Image Classification are very popular tasks in computer vision. They have a wide range of applications in defense, healthcare, sports, and the space industry.
The fundamental difference between these two tasks is that image classification identifies an object in an image whereas object detection identifies the object as well as its location in an image. Here’s a classic example to understand this difference:
Well, then how is Object Tracking different from Object Detection?
Object Tracking and Object Detection are similar in terms of functionality. These two tasks involve identifying the object and its location. But, the only difference between them is the type of data that you are using. Object Detection deals with images whereas Object Tracking deals with videos.
Object Detection applied on each and every frame of a video turns into an Object Tracking problem.
As a video is a collection of fast-moving frames, Object Tracking identifies an object and its location from each and every frame of a video.
Object Detection is one of the most challenging problems in computer vision. Having said that, there has been an immense improvement over the past 20 years in this field. We can broadly divide this into two generations – before and after deep learning:
Now, I will discuss some of the popular and widely used techniques for Object Detection.
The simplest approach to build an Object Detection model is through a Sliding Window approach. As the name suggests, an image is divided into regions of a particular size and then every region is classified into the respective classes.
Remember that the regions can be overlapping and varying in size as well. It all depends on the way you want to formulate the problem.
Model Workflow
This method is really simple and efficient. But it’s a time-consuming process as it considers the huge number of regions for classification. Now, we will see how we can reduce the number of regions for classification in the next approach.
So how do we make this a non-time consuming task? This can be brought down by discarding the regions that are not likely to contain the object. This process of extracting the regions that are likely to contain the object is known as Region Proposals.
Region proposals have a higher probability of containing an object
Many Region Proposal algorithms have been proposed to select a Region of Interest (ROI). Some of the popular ones are objectness, selective search, category-independent object proposals, etc. So, R-CNN was proposed with an idea of using the exterior region proposal algorithm.
R-CNN stands for Region-based Convolutional Neural Network. It uses one of the external region proposal algorithms to select the region of interest (ROI).
Model Workflow
The predicted regions can be overlapping and varying in size as well. So, Maximum Non Suppression is used to ignore the bounding boxes depending upon the Intersection Over Union (IOU) score:
Certainly, R-CNN’s architecture was the State of the Art (SOTA) at the time of the proposal. But it consumes nearly 50 seconds for every test image during inference because of the number of forward passes to a CNN for feature extraction. As you can observe under the model workflow, every region proposal is passed to a CNN for feature extraction.
For example, if an image has 2000 regions of proposals, then the number of forward passes to the CNN is around 2000. This inevitably led to another model architecture known as Fast R-CNN.
In order to reduce the inference speed, a slight change in the R-CNN workflow was made and proposed, known as Fast R-CNN. The modification was done in the feature extraction of region proposals.
In R-CNN, feature extraction takes place for each region proposal whereas, in Fast R-CNN, feature extraction takes place only once for an original image. Then the relevant ROI features are chosen based on the location of the region proposals. These region proposals are constructed before passing an image to CNN.
Remember that the input to CNN is the actual image without any ROI:
Model Workflow
During inference, Fast R-CNN consumes nearly 2 seconds for each test image and is about 25 times faster than R-CNN. The reason being the change in the feature extraction of ROI. For example, if an image has a 2000 region of proposals, then the number of forward passes to the CNN is around 1.
Can we still bring down the inference speed? Yes! It’s possible. This led to Faster R-CNN, a SOTA model for object detection tasks.
Faster R-CNN replaces the exterior region proposal algorithm with a Region Proposal Network (RPN). RPN learns to propose the region of interests which in turn saves a lot of time and computation as compared to a Fast R-CNN.
Faster R-CNN = Fast R-CNN + RPN
Model Workflow
Faster R-CNN takes close to 0.2 seconds for every test image during inference and is about 250 times faster than Fast R-CNN.
Social Distancing is the only way to prevent the spread of COVID-19 right now. Recently, Andrew Ng’s Landing AI team created a Social Distancing Tool using the concepts of Computer Vision. This project is inspired by their work. You can download the video from here.
Time to power up your coding skills!
Note: The code is developed on Google Colab. I would recommend using the same. Change the runtime to GPU prior to installing libraries.
Understanding Detectron 2
Detectron 2 is an open-source library for object detection and segmentation created by the Facebook AI Research team, popularly known as FAIR. Detectron 2 implements state of the art architectures like Faster R CNN, Mask R CNN, and RetinaNet for solving different computer vision tasks, such as:
The baseline models of Faster R-CNN and Mask R-CNN are available with 3 different backbone combinations. Please refer to this Detectron-2 GitHub repository for additional details.
Let’s begin!
Install Dependencies
Import Libraries
Reading a video
Read a video and save frames to a folder:
Check the frame rate of a video:
Output: 25.0
Download the pre-trained model for object detection from Detectron 2’s model zoo and then the model is ready for inference:
Read an image and pass it to the model for predictions:
Can you guess the output of the model? Yes, Objects and Locations, since its an object detection model. We can use Visualizer to draw the predictions on the image:
As you can see here, multiple objects are present in an image, like a person, bicycle, and so on. We are well on our way to building the social distancing detection tool!
Next, understand the objects present in an image:
Have a glance at the bounding boxes of an object:
As different objects are present in an image, let’s identify classes and bounding boxes related to only the people:
Understand the format of the bounding box:
Draw a bounding box for one of the people:
Our ultimate goal is to compute the distance between two people in an image. Once we know the bounding box for each person, we can easily compute the distance between any two people. But the challenge here is to select the right coordinate for representing a person as a bounding box is in the form of a rectangle.
I have chosen the bottom center of a rectangle for representing each person to measure the distance accurately and also this measure is invariant of the height of a person:
Define a function that returns the bottom center of every bounding box:
Compute the bottom center for every bounding box and draw the points on the image:
Define a function to compute the Euclidean distance between every two points in an image:
Compute the distance between every pair of points:
Define a function that returns the closest people based on the given proximity distance. Here, proximity distance refers to the minimum distance between two people:
Set the threshold for the proximity distance. Here, I have chosen that to be 100. Let’s find the people who are within the proximity distance:
From the output, we can observe that 4 people come under the red zone as the distance between them is less than the proximity threshold.
Define a function to change the color of the closest people to red:
Let’s change the color of the closest people to red:
So far, we have seen a step by step procedure on how to apply object detection using Detectron-2, compute the distance between every pair of people, and then finally identify the closest people. We will carry out similar steps on each and every frame of the video now:
Define a function that performs all the steps we covered on each and every frame of the video:
Identify the closest people in each frame and change the color to red:
After identifying the closest people in each frame, convert the frames back to a video. That’s it!
Keep in mind that the projection of the camera also matters a lot while computing the distance between the objects in an image.
In our case, I have not taken into account the projection of the camera since the impact of the camera’s projection on the estimated distance is minimum. However, the universal approach is to convert a video into a top view or birds’ eye view and then compute the distance between two objects in an image. This task is known as Camera Calibration.
Keep that in mind if you want to explore this further and customize your own social distancing detection tool.
This brings us to the end of our tutorial on how to build your own social distancing tool using computer vision. I hope you have enjoyed the tutorial and found it useful. If you have any comments/queries, kindly leave them in the comments section below and I will reach out to you. And remember:
Stay SAFE (Stay Away From Everyone) to prevent the spread of the COVID-19 pandemic
The code is not working in Google Colab. Error: -------------------------------------------------------------------------- ImportError Traceback (most recent call last) in () 14 15 # import some common detectron2 utilities ---> 16 from detectron2 import model_zoo 17 from detectron2.engine import DefaultPredictor 18 from detectron2.config import get_cfg 4 frames /usr/local/lib/python3.6/dist-packages/detectron2/layers/deform_conv.py in () 8 from torch.nn.modules.utils import _pair 9 ---> 10 from detectron2 import _C 11 12 from .wrappers import _NewEmptyTensorOp ImportError: libtorch_cpu.so: cannot open shared object file: No such file or directory --------------------------------------------------------------------------- NOTE: If your import is failing due to a missing package, you can manually install dependencies using either !pip or !apt. To view examples of installing some common dependencies, click the "Open Examples" button below.
Please install torch 1.5 to resolve the error
Great work!! You are using euclidean distance for measuring distance but doesnot variation due to depth of two persons affect it.Two people who are standing one infront of other will appear as they are close but actually they can be wide apart or viceversa. How to take that into account ? Is there a true need for taking that into account?
Hi Aravind, I am trying to run to do installation, packages get installed but when try to import detectron i get the below error. any idea what's this issue. 8 from torch.nn.modules.utils import _pair 9 ---> 10 from detectron2 import _C 11 12 from .wrappers import _NewEmptyTensorOp ImportError: libtorch_cpu.so: cannot open shared object file: No such file or directory
Hi, installing torch 1.5 resolves the error
Is it possible to do face mask detection on same video as well considering the dimensions of face will be very small in such a scenario?
Hi Karanbir, coming to the first question, yes, it's possible to carry out face detection on same video to identify whether a person is wearing a mask or not. This would be another use case of object Detection & Tracking. However, the task becomes easier when the person is facing close to the camera and quite challenging in the other scenario.
Hello there! Great tool, but it seems that you're not considering perspective, so in real applications this would most likely fail. An idea would be to consider points at floor level (the bottom of the rectangles, assuming they touch the floor at all times, which is quite reasonable) and allow a transformation of space for computing distance (think of kt as a grid on the floor). Let me know if you implement it. Otherwise, let me know anyways to see if I give it some try. Great job and regards!
Hi, Perspective Transformation is the next step to the starter provided here. As I mentioned in the article, camera projection plays a significant role while computing the distance between the objects. We can implement it using OpenCV Warpperspective. The challenge here is to choose the four points from the image for converting to a top view.
getting error in 4th last line i,e . out variable out = cv2.VideoWriter('sample_output.mp4',cv2.VideoWriter_fourcc(*'DIVX'), 25, size) kindly explain where the path for sample_outpu.tmp4 is being given , IS IT REQUIRED?? and will it generate out sample viedio automatically. note: uisng MACOS i
Hi Aravind , I am very much interested and eager to deploy this social distancing detector use case practically but just I am learning deep learning. would you please tell me the prerequisites or the packages upon which I can make this model more efficienctly ???
Hi Aravind, Thanks for such clear article on social distancing tool. You inspire me to do something similar. However, i intend to modify it for people standing in a queue one behind another which are partially captured by cctv. Need your suggestion on whether this will work there too or else instead of identifying entire human body should we just identify head position. And one more query, can we simply select mid point of the identified rectangle to find distance from top view Thanks
hello can you help me with the project,how can we do it in real time?