Shivani Sharma — August 29, 2021
Advanced Computer Vision Deep Learning Maths Object Tracking

This article was published as a part of the Data Science Blogathon

The task of tracking objects in an image is one of the hottest and most requested areas of ML. However, we already have a huge variety of different techniques and tools. This article will help you start your journey into the world of computer vision!

First, we will introduce some types of visual tracking techniques. Next, we will explain how to classify them. We’ll also talk about the fundamental aspects of direct visual tracking, with a focus on region-based and gradient-based methods. Finally, we’ll show you how to implement these methods in Python. Let’s start!

Visual Tracking – Introduction

Visual tracking, also known as object tracking or video track, is the task of estimating the trajectory of a target object in a scene using visual information. Visual information can come from a variety of image sources. We can use optical cameras, thermal imagers, ultrasound, X-ray, or magnetic resonance.

List of the most common imaging devices:

image devices |Visual Tracking


Moreover, visual tracking is a very popular topic because it has applications in a wide variety of tasks. For example, it is applied in the fields of human-computer interaction, robotics, augmented reality, medicine, and the military.

The following image shows examples of visual tracking applications:

Visual Tracking applications


Now let’s look at how we can categorize the solutions available today.

Classification of visual tracking methods

According to the below components we can classify the famous Visual tracking methods:

visual tracking methods


Let’s go much deeper into each component.

Objective presentation

Firstly, we need to focus on what object we are tracking. This component is called the visual tracking representation targets (target representation). There are several typical representations of targets. The highlights are:

objective presentation |Visual Tracking


However, among these target views, the bounding box is the most common method. The reason is that the bounding box easily defines many objects.

Appearance model

So, we’ve looked at several ways to represent our purpose. Now let’s take a look at how to model the appearance of the target. The idea of ​​the external model is to describe the target object based on the available visual information.

Image histogram

For example, in the image below, we can see a soccer player in blue running across the field. The player is represented by a bounding box.

Image histogram | Visual Tracking


This bounding box will define a histogram. We usually use a histogram on a grayscale image, but we can also use a color histogram. In the image above, we can represent the color histogram of a rectangular bounding box. We can use this histogram to distinguish the target player from the green background.

Now let’s illustrate this with an example. For example, we have a histogram where 70% is blue and 30% is green. This means that when the player moves, we’d like to maneuver the bounding box around the area and find the place with the very best percentage of blue. This way, once we find the frame, we will always have a perfect match with the original histogram. This way we can track the player.

Image intensity

In addition, we can use the reference image itself (reference) as an appearance model. In this case, the target object is measured as a set of pixel intensities. For example, if the target is moving, our goal is to find an exact match with the reference image. This process is called template matching. It defines an area of ​​the image that matches a previously defined pattern. However, the problem with visual tracking is that the image can be deformed, inverted, projected, etc. This means that pattern matching will not work very well if the image is distorted.

image intensity | Visual Tracking


We can also represent the target with a filter bank, which computes the resulting image using the first pixel intensities. We can use distribution fields as a model of appearance.

Image attributes

Another very popular type of appearance model is image tags. It is based on a reference image of the target, where a set of distinguishable features can be computed to represent the target. Several object detection algorithms are often used to extract features.image attribute | Visual Tracking


Subspace decomposition

In some cases, reference image subspaces are used to simulate the appearance of an object. These more sophisticated models have proven to be very useful in situations where the appearance of the tracked object changes over time. In this context, Principal Component Analysis and dictionary-based approaches are often used. Here you can parse the reference image of the target object. For example, let’s say we have a dataset of images of 100 people. We’ll get the middle image and add one component. This component fixes the direction where the person is looking – to the left or to the right. Then, we can use this component to find people looking to the right ( Eigenface – one of the approaches for recognizing people in the image).

Next, we’ll focus on the types of appearance models that are often used in region-based tracking methods.

Tracking methods based on region

Region-based tracking came from the idea of ​​tracking a region or part of an image. As we said before, we will represent the target object using a bounding box. To keep track of objects bounded by a bounding box, we need to define an appropriate appearance model. In the example below, the appearance model is an image intensity template. Here we have a reference image of the target on the left and we are looking for the best match in the original image.

region based tracking


Now that we have accepted the appearance model for our target object, we need to simulate its movement in the scene. This means that the tracking problem is solved by finding the parameters of the motion model. The similarity between the reference image and the original target image is maximized by motion model parameters. For example, suppose the target is only moving horizontally and vertically in the scene. In this case, a simple translational model with two parameters t x and t y will be sufficient to simulate the position of the reference image.

Naturally, if the target object moves in a more complex way, then we need to set up and use more complex transformation models with additional degrees of freedom, as shown below:

distance matrix | Visual Tracking

For example, if we are tracking the cover of a book, then we must use a projection model that has 8 degrees of freedom. On the other hand, if the target is not rigid, then we need to use a deformable model. So we could use B-spline or Thin-Plate Splines to correctly describe the deformation of the object.

Parametric models are deformable:

  • Splines (B- Splines, TPS,  Multivariate)

  • Triangular meshes

To initialize the search for its current position we often use the position of the target object in previous frames. So, given the vector of parameters p t-1, our moving model in the previous frame t-1, our task is to find a new vector p t that best matches the reference and current images.

Similarity function

This brings us to a very interesting question: What is the best match for the reference and the current image? To find the best match, you need to find the part of the current image that most closely resembles the reference image. This means that we have to choose a similarity function f between the reference and the original image. This was used in pattern matching. In the following example, we can see that the similarity between the first two images must be greater than the similarity between the second two images.

similarity function | Visual Tracking

Several similarity functions are used to calculate the similarity between the template and the original image. Here are a few of them:

Similarity functions:

  • Sum of Absolute Differences ( SAD )

  • Sum of Squared Differences ( SSD )

  • Normalized Cross-Correlation ( NCC )

  • Mutual Information ( MI )

  • Structural Similarity Index ( SSIM )

So, we found out that for tracking, you need to select the appearance model of the target object, the motion model, and the similarity function to determine how similar the reference image is to the original image in the video. Considering the parameters p t-1 for the previous frame t-1, we need to develop a strategy for finding new model parameters p t at the current time t. The simplest approach is to define a local search area around the previous p t-1 parameters. In the example below, we will move from -20px to + 20px in the x-axis and from -20pixels up to +20 pixels on the y- axis from the target’s position in the previous frame (assuming we only have a broadcast).

refrence image


With the help of prior knowledge of the object’s motion, we can shorten the exhaustive search in a large neighbourhood of the object’s previous position. For example, we can use the classic Kalman filtration system or more sophisticated filters such as a particle filter.

Gradient-based methods

Another very fashionable search strategy is gradient descent. We first select a similarity function that is differentiable with respect to tracking parameters and has a smooth and convex landscape around the best fit. Then we can use gradient methods and find the optimal parameters of the transformation (movement) model.

In the following example, we have a case where we need to calculate SSD (Sum of Squared Differences).

gradient based method


Suppose the green rectangle is a reference image and we want to check its similarity to the original image (blue rectangle). We’ll compute the SSD by sliding the blue rectangle so that it matches the green rectangle and subtract those two images. Then we square the difference and add up. If we get a small number, it means we have a similar pattern. This process is shown in the following figure.

matrix |

It is important to note that SSD will be a function of the vector p , where p = [xy] is our vector. Here x and y are the translation parameters we are looking for. The result of calculating the SSD score for the blue rectangle, for an offset of plus or minus five pixels around the optimal alignment point, gives us this curve. Thus, we can clearly see the convex and smooth nature of the SSD in this example.



In the example above on the right, we see a two-dimensional function from a bird’s eye view. The minimum is in the center, and then high values ​​are located around it. Now, if we want to draw this function in one-dimensional form, it will look like this:


Let’s say we are looking along the x-direction. First, we will randomly choose a starting position for x. Let’s say x = 4 . Then we calculate the gradient of the SSD function. Next, we learn that we need to move to the minimum of the function. The gradient will tell us which direction to move in the original image.

So what is the main advantage of gradient descent? Let’s take an example of a transformation model with various degrees of freedom, such as the projective model, which we use to track the board in the following example.



First, let’s explain what multiple degrees of freedom mean. Let’s say we have an original rectangle image and a template image. Note that in the example below, the rectangle in the original image on the left is the projected version of the template image on the right.

original and template image

However, it will now be impossible to compute the SSD. One way to solve this problem is to detect key points in both images, and then use some feature matching algorithm that will find their matches. However, we can also search using the intensity values ​​of the template image. To do this, we will apply a transformational curvature. As we explained earlier in this article, we will multiply the image by the following transition matrix:

This means that here we have 8 degrees of freedom because in the matrix we have only 8 parameters and one number, which is fixed at 1. So our original rectangle will now have a change in perspective. This means that to calculate the SSD, in addition to finding the x and y translation parameters, we also need to find other parameters to represent rotation, scaling, skew, and projection.

So, the main advantage of gradient descent is that when rotating, scaling, and deforming the desired object, we don’t have to go through 1000 and 1000 combinations to find the best transformation parameters. With gradient descent, we can get these parameters with very high precision in just a few iterations. Thus, this is a significant saving in computational effort.


In this article, we learned that image tracking techniques have four main components: appearance models, transformation models, similarity measures, and search strategies. We presented several appearance models and also talked about transformation models, both rigid and non-rigid. In addition, we looked at how to calculate SSD and also covered how to apply gradient descent, one of the most common search strategies. In future articles, we will continue to use these methods.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *