Shivani Sharma — September 7, 2021
Advanced Computer Vision Deep Learning Object Detection Project Python

This article was published as a part of the Data Science Blogathon


In this tutorial, we will discuss the implementation of the YOLO Object Detection system in Tensorflow 2.0.

YOLO is the latest object detection system (network). It was designed by Joseph Redmon. Models in the YOLO family are exceptionally fast and far outperform R-CNN ( Region-Based Convolutional Neural Network ) and other models. This allows real-time detection of objects.

YOLO has reimagined object detection into a regression problem. It goes from the pixels of the image to the coordinates of bounding boxes and class probabilities. Thus, a single convolutional network predicts several bounding boxes and class probabilities for containing these areas.

Object detection with YOLOv3 Theory 

Since YOLO only needs one look at the image, the sliding window method is not suitable in this situation. Instead, the image will be divided into a grid with cells of the size. Each cell can contain several different objects for recognition. S x S

First, each cell is responsible for predicting the number of bounding boxes. Also, each cell predicts a confidence value for each area bounded by the bounding box. In other words, this value determines the probability of finding a particular object in a given area. That is, in case some cell of the grid does not have a specific object, it is important that the confidence value for this area is low.

When we render all the predictions, we get a map of objects and frames, ordered by confidence.

Object detection

Image 1

Second, each cell is responsible for predicting class probabilities. This does not mean that some cell contains some object, only the probability of finding the object. For example, if a cell predicts a car, this does not guarantee that the car is actually present in it. It only says that if an object is present, then this object is most likely a car.

Let’s describe the output of the model in more detail.

In YOLO used the anchor boxes ( anchor frame / fixed frames ) to predict the bounding box’ov. The idea behind anchor boxes is to pre-define two different shapes. And so we can combine two predictions with two anchor boxes (in general, we could use even more anchor boxes). These anchors were calculated using a dataset COCO ( the Common the Objects in the Context ) clustering and k-means ( the K-Means clustering ).

Object detection with YOLOv3

Image 2

We have a grid where each cell predicts:

  • For each bounding box:

    • 4 coordinates (x , y , w , h )

    • 1 objectness error ( objecthood error ), which is an indicator of confidence in the presence of an object

  • A number of class probabilities

If there is some offset from the upper left corner by c x, c y, then the forecasts will correspond to:

where p w ( width ) and p h ( height ) correspond to the width and height of the bounding box. This output is the output of our neural network. In total there S x S x [B * (4+1+C)]conclusions where B- is the number of bounding box’ov, which can predict the cell objects on the map, C- a number of classes 4- for bounding box’ov, 1- for objectness prediction ( prediction objecthood ). In one pass, we can go from the input image to the output tensor, which corresponds to the detected objects in the image. It is also worth noting that YOLOv3 predicts bounding boxes at three different scales.

Now, if we take the probability and multiply them by the confidence values, we get all bounding boxes weighted by the probability of containing this object.

Simply finding the threshold will save us from low-confidence predictions. For the next step, it is important to define the IoU ( Intersection over Union) metric. This metric is equal to the ratio of the area of ​​intersecting areas to the area of ​​areas combined.

intersection over union

Image 3

After this, can still remain duplicates, and to get rid of them need to use the “suppression of the non-maxima ‘ ( the non-maximum suppression ). The suppression of non-maxima is as follows: the algorithm takes the bounding box with the highest probability of belonging to an object, then, among the rest of the bordering bounding boxes from this area, it takes the one with the highest IoU and suppresses it.

Due to the fact that everything is done in one run, this model will work almost as fast as classification. In addition, all detections are predicted simultaneously, which means that the model is implicitly aware of the global context. Simply put, the model can find out which objects are usually found together, their relative size and location of objects, and so on.

Yolov3 | Object detection with YOLOv3

Image 4

Implementation of Object detection with YOLOv3 in Tensorflow

The first step in implementing YOLO is preparing the laptop and importing the required libraries.

Following this article, we will make a fully convolutional network ( FCN ) without training. In order to apply this network to define objects, we need to download ready-made weights from a pre-trained model. These weights were obtained from training YOLOv3 on the COCO ( Common Objects in Context ) dataset. The file with the scales can be downloaded from the link to the official website.

# Create a folder for checkpoints with weights.
#! mkdir checkpoints
# Download the file with weights for YOLOv3 from the official site.
#! wget
# Import the required libraries.
import cv2
import numpy as np 
import tensorflow as tf 
from absl import logging
from itertools import repeat
from PIL import Image
from tensorflow.keras import Model
from tensorflow.keras.layers import Add, Concatenate, Lambda
from tensorflow.keras.layers import Conv2D, Input, LeakyReLU
from tensorflow.keras.layers import MaxPool2D, UpSampling2D, ZeroPadding2D
from tensorflow.keras.regularizers import l2
from tensorflow.keras.losses import binary_crossentropy
from tensorflow.keras.losses import sparse_categorical_crossentropy
yolo_iou_threshold = 0.6 
yolo_score_threshold = 0.6 # Score threshold.
weightyolov3 = 'yolov3.weights' # Path to the file with weights.
size = 416 # Image size.
checkpoints = 'checkpoints /' # Path to the checkpoint file.
num_classes = 80 # The number of classes in the model.

Due to the fact that the order of layers in the Darknet ( open-source NN framework ) and tf.keras is different, loading weights using a purely functional API will be problematic. In this case, the best solution would be to create submodels in keras. TF Checkpoints are recommended for saving nested submodels and are officially supported by Tensorflow.

# Function to load the weights of the trained model.
def load_darknet_weights(model, weights_file):
    wf = open(weights_file, 'rb')
    layers = YOLO_V3_LAYERS
for lay_name in layers:
        s_model = model.get_layer(lay_name)
for i, layer in enumerate(s_model.layers):
if not'conv2d'):
            batch_n = None
if i + 1 < len(s_model.layers) and 
                s_model.layers[i + 1].name.startswith('batch_n'):
                    batch_n = s_model.layers[i + 1]
  "{}/{} {}".format(
      ,, 'bn' if batch_n else 'bias'))
            ft = layer.filters
            size = layer.kernel_size[0]
            in_dim = layer.input_shape[-1]
if batch_n is None:
                conv_bias = np.fromfile(wf, dtype=np.float32, count=ft)
                bn_weight = np.fromfile(wf, dtype=np.float32, count=4*ft)
                bn_weight = bn_weight.reshape((4, ft))[[1, 0, 2, 3]]
            c_shape = (ft, in_dim, size, size)
            c_weights = np.fromfile(wf, dtype=np.float32, count=np.product(c_shape))
            c_weights = c_weights.reshape(c_shape).transpose([2, 3, 1, 0])
if batch_n is None:
                layer.set_weights([c_weights, c_bias])
assert len( == 0, 'declined!!'

At the same stage, we must define a function to calculate the IoU. We use batch normalization ( packet normalization ) to normalize the results to speed up training. Since tf.keras.layers.BatchNormalization does not work very well for the transfer of learning ( transfer learning ), then we use a different approach.

# Function for calculating IoU.
def interval_overlap(int_1, int_2):
    y1, y2 = int_1
    y3, y4 = int_2
if y3 < y1:
return 0 if y4 < y1 else (min(y2,y4) - y1)
return 0 if y2 < y3 else (min(y2,y4) - y3)
def intersectionOverUnion(b1, b2):
    inter_w = int_overlap([b1.xmin, b1.xmax], [b2.xmin, b2.xmax])
    inter_h = int_overlap([b1.ymin, b1.ymax], [b2.ymin, b2.ymax])
    inter_area = inter_w * inter_h
    w1, h1 = b1.xmax-b1.xmin, b1.ymax-b1.ymin
    w2, h2 = b2.xmax-b2.xmin, b2.ymax-b2.ymin
    union_area = w1*h1 + w2*h2 - inter_area
return float(inter_area) / union_area
class BatchNormalization(tf.keras.layers.BatchNormalization):
def call(self, x, train=False):
if train is None: train = tf.constant(False)
        train = tf.logical_and(train, self.trainable)
return super().call(x, train)
# Define 3 anchor boxes for each cell.   
y_anch = np.array([(10, 13), (16, 30), (33, 23), (30, 61), (62, 45),
                        (57, 117), (114, 86), (154, 194), (373, 323)], np.float32) / 416
y_anchor_mask = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])

At each scale, we define 3 anchor boxes for each cell. In our case, if the mask is:

  • 0, 1, 2 – means the first three anchor boxes will be used

  • 3, 4, 5 – means the fourth, fifth and sixth will be used

  • 6, 7, 8 – means that the seventh, eighth, ninth will be used

# Function for drawing bounding boxes.
def draw_outputs(img, op, c_name, white_list=None):
    box, score, class, nums = op
    box, score, class, nums = box[0], score[0], class[0], nums[0]
    wh = np.flip(img.shape[0:2])
for i in range(nums):
if c_name[int(classes[i])] not in white_list:
        x_1y_1 = tuple((np.array(box[i][0:2]) * wh).astype(np.int32))
        x_2y_2 = tuple((np.array(box[i][2:4]) * wh).astype(np.int32))
        ig = cv2.rectangle(ig, x_1y_1, x_2y_2, (255, 0, 0), 2)
        ig = cv2.putText(img, '{} {:.4f}'.format(
            c_name[int(classes[i])], score[i]),
            x_1y_1, cv2.FONT_HERS_COMP_SMALL, 1, (0, 0, 255), 2)
return img

Now it’s time to implement YOLOv3. The idea is to only use convolutional layers. Since there are 53 of them, the easiest way is to create a function into which we will pass important parameters that change from layer to layer.

implementation architecture | Object detection with YOLOv3

Image 5

are used for the study of signs. The residual block contains several convolutional layers and additional links to traverse these layers.

residual blocks | Object detection with YOLOv3

Image 6

By creating our model, we are building our model with a functional API that will be easy to use. With it, we can easily define branches in our architecture ( ResNet Block ) and divide layers within the architecture.

def DarkConv(x, ft, size, stride=1, batch_n=True):
if stride == 1:
        padding = 'same'
        x = ZeroPadding2D(((1, 0), (1, 0)))(x)
        padding = 'valid'
    x = Conv2D(ft=ft, kernel_size=size,
              stride=stride, padding=padding,
              use_bias=not batch_n, k_regular=l2(0.0005))(x)
if batch_norm:
        x = BatchNormalization()(x)
        x = LeakyReLU(alpha=0.1)(x)
return x
def DarknetResidual(x, ft):
    prev = x
    x = DarkConv(x, ft // 2, 1)
    x = DarkConv(x, ft, 3)
    x = Add()([prev , x])
return x
def DarkBlock(x, ft, block):
    x = DarkConv(x, ft, 3, stride=2)
for _ in repeat(None, block):
        x = DarknetResidual(x, ft)
return x
def Darknet(name=None):
    y = inputs = Input([None, None, 3])
    y = DarkConv(y, 32, 3)
    y = DarkBlock(y, 64, 1)
    y = DarkBlock(y, 128, 2)
    y = x_36 = DarknetBlock(y, 256, 8)
    y = x_61 = DarknetBlock(y, 512, 8)
    y = DarkBlock(y, 1024, 4)
return tf.keras.Model(inputs, (x_36, x_61, x), name=name)
def YoloConv(ft, name=None):
def yolo_conv(x_in):
if isinstance(x_in, tuple):
            inp = Input(x_in[0].shape[1:]), Input(x_in[1].shape[1:])
            x, x_skip = inp
            x = DarkConv(x, filters, 1)
            x = UpSampling2D(2)(x)
            x = Concatenate()([x, x_skip])
            x = Input(x_in.shape[1:])
        x = DarkConv(x, filters, 1)
        x = DarkConv(x, filters * 2, 3)
        x = DarkConv(x, filters, 1)
        x = DarkConv(x, filters * 2, 3)
        x = DarkConv(x, filters, 1)
return Model(inputs, x, name=name)(x_in)
return yolo_conv
def YoloOutput(ft, anch, class, name=None):
def yolo_op(x_in):
        x = inp = Input(x_in.shape[1:])
        x = DarkConv(x, ft * 2, 3)
        x = DarkConv(x, anch * (class + 5), 1, batch_n=False)
return tf.keras.Model(inputs, x, name=name)(x_in)
return yolo_op
def yolo_boxes(pred, anch, class):
    g_size = tf.shape(pred)[1]
    b_xy, b_wh, score, c_prob = tf.split(pred, (2, 2, 1, class), axis=-1)
    b_xy = tf.sigmoid(b_xy)
    sc = tf.sigmoid(sc)
    c_prob = tf.sigmoid(c_prob)
    pred_box = tf.concat((b_xy, b_wh), axis=-1)
    gr = tf.meshgrid(tf.range(g_size), tf.range(g_size))
    gr = tf.expand_dims(tf.stack(gr, axis=-1), axis=2)
    b_xy = (b_xy + tf.cast(gr, tf.float32)) /  tf.cast(g_size, tf.float32)
    b_wh = tf.exp(box_wh) * anch
    b_x1y1 = b_xy - b_wh / 2
    b_x2y2 = b_xy + b_wh / 2
    bbox = tf.concat([b_x1y1, b_x2y2], axis=-1)
return bbox, score, c_prob, pred_box

Now let’s define a non-maximum suppression function.

def nonMaximumSuppression(op, anch, mask, class):
    boxes, conf, o_type = [], [], []
for output in op:
        boxes.append(tf.reshape(op[0], (tf.shape(op[0])[0], -1, tf.shape(op[0])[-1])))
        conf.append(tf.reshape(output[1], (tf.shape(op[1])[0], -1, tf.shape(op[1])[-1])))
        o_type.append(tf.reshape(op[2], (tf.shape(op[2])[0], -1, tf.shape(op[2])[-1])))
    bbox = tf.concat(boxes, axis=1)
    confidence = tf.concat(conf, axis=1)
    c_prob = tf.concat(o_type, axis=1)
    scores = confidence * c_prob
    boxes, scores, classes, valid_detections = tf.image.combined_non_max_suppression(
        boxes=tf.reshape(bbox, (tf.shape(bbox)[0], -1, 1, 4)),
            scores, (tf.shape(scores)[0], -1, tf.shape(scores)[-1])),
return boxes, scores, classes, valid_detections

Main function:

def YoloV3(size=None, chan=3, anchors=yolo_anchors,
            masks=yolo_anchor_masks, classes=80, training=False):
    x = inputs = Input([size, size, chan])
    x_36, x_61, x = Darknet(name='yolo_darknet')(x)
    x = YoloConv(512, name='yolo_conv_0')(x)
    output_0 = YoloOutput(512, len(masks[0]), classes, name='yolo_output_0')(x)
    x = YoloConv(256, name='yolo_conv_1')((x, x_61))
    output_1 = YoloOutput(256, len(masks[1]), classes, name='yolo_output_1')(x)
    x = YoloConv(128, name='yolo_conv_2')((x, x_36))
    output_2 = YoloOutput(128, len(masks[2]), classes, name='yolo_output_2')(x)
if training:
return Model(inputs, (output_0, output_1, output_2), name='yolov3')
return Model(inputs, outputs, name='yolov3')

Loss function:

def YoloLoss(anchors, classes=80, ignore_thresh=0.5):
def yolo_loss(y_true, y_pred):
        pred_box, pred_obj, pred_class, pred_xywh = yolo_boxes(
            y_pred, anchors, classes)
        pred_xy = pred_xywh[..., 0:2]
        pred_wh = pred_xywh[..., 2:4]
        t_box, true_obj, true_class_idx = tf.split(
            y_true, (4, 1, 1), axis=-1)
        tr_xy = (t_box[..., 0:2] + t_box[..., 2:4]) / 2
        t_wh = t_box[..., 2:4] - t_box[..., 0:2]
        b_loss = 2 - t_wh[..., 0] * t_wh[..., 1]
        g_size = tf.shape(y_true)[1]
        grid = tf.meshgrid(tf.range(g_size), tf.range(g_size))
        grid = tf.expand_dims(tf.stack(grid, axis=-1), axis=2)
        tr_xy = tr_xy * tf.cast(g_size, tf.float32) - 
            tf.cast(grid, tf.float32)
        t_wh = tf.math.log(t_wh / anchors)
        t_wh = tf.where(tf.math.is_inf(t_wh),
                      tf.zeros_like(t_wh), t_wh)
        obj_mask = tf.squeeze(true_obj, -1)
        t_box_flat = tf.boolean_mask(t_box, tf.cast(obj_mask, tf.bool))
        best_iou = tf.reduce_max(intersectionOverUnion(
            pred_box, true_box_flat), axis=-1)
        ignore_mask = tf.cast(best_iou < ignore_thresh, tf.float32)
        xy_loss = obj_mask * b_loss * 
            tf.reduce_sum(tf.square(true_xy - pred_xy), axis=-1)
        wh_loss = obj_mask * b_loss * 
            tf.reduce_sum(tf.square(true_wh - pred_wh), axis=-1)
        o_loss = binary_crossentropy(true_obj, pred_obj)
        o_loss = obj_mask * o_loss + 
            (1 - obj_mask) * ignore_mask * o_loss
        class_loss = obj_mask * sparse_categorical_crossentropy(
            true_class_idx, pred_class)
        xy_loss = tf.reduce_sum(xy_loss, axis=(1, 2, 3))
        wh_loss = tf.reduce_sum(wh_loss, axis=(1, 2, 3))
        o_loss = tf.reduce_sum(o_loss, axis=(1, 2, 3))
        class_loss = tf.reduce_sum(class_loss, axis=(1, 2, 3))
return xy_loss + wh_loss + o_loss + class_loss
return yolo_loss

The function “transform targets” returns a tuple from the forms:


[N, 13, 13, 3, 6],

[N, 26, 26, 3, 6],

[N, 52, 52, 3, 6]


Where N is the number of labels in the package, and the number 6 means the [x, y, w, h, obj, class]bounding box.


def transform_targets_for_output(y_true, grid_size, anchor_idxs, classes):
    N = tf.shape(y_true)[0]
    y_true_out = tf.zeros(
      (N, grid_size, grid_size, tf.shape(anchor_idxs)[0], 6))
    anchor_idxs = tf.cast(anchor_idxs, tf.int32)
    indexes = tf.TensorArray(tf.int32, 1, dynamic_size=True)
    updates = tf.TensorArray(tf.float32, 1, dynamic_size=True)
    idx = 0
for i in tf.range(N):
for j in tf.range(tf.shape(y_true)[1]):
if tf.equal(y_true[i][j][2], 0):
            anchor_eq = tf.equal(
                anchor_idxs, tf.cast(y_true[i][j][5], tf.int32))
if tf.reduce_any(anchor_eq):
                box = y_true[i][j][0:4]
                anchor_idx = tf.cast(tf.where(anchor_eq), tf.int32)
                grid_xy = tf.cast(b_x_y // (1/grid_size), tf.int32)
                indexes = indexes.write(
                    idxes, [i, grid_xy[1], grid_xy[0], anchor_idx[0][0]])
                                idxes += 1
return tf.tensor_scatter_nd_update(
        y_true_out, indexes.stack(), updates.stack())
def transform_targets(y_train, anchors, anchor_masks, classes):
    outputs = []
    grid_size = 13
    anchors = tf.cast(anchors, tf.float32)
    a_area = anchors[..., 0] * anchors[..., 1]
    b_wh = y_train[..., 2:4] - y_train[..., 0:2]
    b_wh = tf.tile(tf.expand_dims(box_wh, -2),
                    (1, 1, tf.shape(anchors)[0], 1))
    b_area = b_wh[..., 0] * box_wh[..., 1]
    inters = tf.minimum(b_wh[..., 0], anchors[..., 0]) * 
    tf.minimum(b_wh[..., 1], anchors[..., 1])
    iou = inters / (box_area + a_area - inters)
    anchor_idx = tf.cast(tf.argmax(iou, axis=-1), tf.float32)
    anchor_idx = tf.expand_dims(anchor_idx, axis=-1)
    y_train = tf.concat([y_train, anchor_idx], axis=-1)
for anchor_idxs in anchor_masks:
            y_train, grid_size, anchor_idxs, classes))
        grid_size *= 2
return tuple(outputs) 
def preprocess_image(x_train, size):
return (tf.image.resize(x_train, (size, size))) / 255

Finally, we are able to create our model, class names, and load weights. There are 80 of them in the COCO dataset.

yolo = YoloV3(classes=num_classes)
load_darknet_weights(yolo, weightyolov3)
def detect_objects(img_path, white_list=None):
    image = img_path     
    img = tf.image.decode_image(open(image, 'rb').read(), channels=3)
    img = tf.expand_dims(img, 0)
    img = preprocess_image(img, size)
    bx, scor, class, num = yolo(img)
    img = cv2.imread(image)
    img = draw_outputs(img, (bx, scor, class, num), class_names, white_list)
    cv2.imwrite('detected_{:}'.format(img_path), img)
    detected ='detected_{:}'.format(img_path))
detect_objects('test.jpg', ['bear'])

outcome | Object detection with YOLOv3


In this article, we talked about the distinctive features of YOLOv3 and its advantages over other models. We looked at how to implement it using TensorFlow 2.0 (TF must be at least version 2.0).


Image 1-

Image 2 –

Image 3 –

Image 4 –

Image 5 –

Image 6 –*6HDuqhUzP92iXhHoS0Wl3w.png

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *