Object detection with YOLOv3 With Tensorflow 2.0
This article was published as a part of the Data Science Blogathon
Introduction
In this tutorial, we will discuss the implementation of the YOLO Object Detection system in Tensorflow 2.0.
YOLO is the latest object detection system (network). It was designed by Joseph Redmon. Models in the YOLO family are exceptionally fast and far outperform RCNN ( RegionBased Convolutional Neural Network ) and other models. This allows realtime detection of objects.
YOLO has reimagined object detection into a regression problem. It goes from the pixels of the image to the coordinates of bounding boxes and class probabilities. Thus, a single convolutional network predicts several bounding boxes and class probabilities for containing these areas.
Object detection with YOLOv3 Theory
Since YOLO only needs one look at the image, the sliding window method is not suitable in this situation. Instead, the image will be divided into a grid with cells of the size. Each cell can contain several different objects for recognition. S x S
First, each cell is responsible for predicting the number of bounding boxes. Also, each cell predicts a confidence value for each area bounded by the bounding box. In other words, this value determines the probability of finding a particular object in a given area. That is, in case some cell of the grid does not have a specific object, it is important that the confidence value for this area is low.
When we render all the predictions, we get a map of objects and frames, ordered by confidence.
Image 1
Second, each cell is responsible for predicting class probabilities. This does not mean that some cell contains some object, only the probability of finding the object. For example, if a cell predicts a car, this does not guarantee that the car is actually present in it. It only says that if an object is present, then this object is most likely a car.
Let’s describe the output of the model in more detail.
In YOLO used the anchor boxes ( anchor frame / fixed frames ) to predict the bounding box’ov. The idea behind anchor boxes is to predefine two different shapes. And so we can combine two predictions with two anchor boxes (in general, we could use even more anchor boxes). These anchors were calculated using a dataset COCO ( the Common the Objects in the Context ) clustering and kmeans ( the KMeans clustering ).
Image 2
We have a grid where each cell predicts:

For each bounding box:

4 coordinates (x , y , w , h )

1 objectness error ( objecthood error ), which is an indicator of confidence in the presence of an object


A number of class probabilities
If there is some offset from the upper left corner by c x, c y, then the forecasts will correspond to:
where p w ( width ) and p h ( height ) correspond to the width and height of the bounding box. This output is the output of our neural network. In total there S x S x [B * (4+1+C)]conclusions where B is the number of bounding box’ov, which can predict the cell objects on the map, C a number of classes 4 for bounding box’ov, 1 for objectness prediction ( prediction objecthood ). In one pass, we can go from the input image to the output tensor, which corresponds to the detected objects in the image. It is also worth noting that YOLOv3 predicts bounding boxes at three different scales.
Now, if we take the probability and multiply them by the confidence values, we get all bounding boxes weighted by the probability of containing this object.
Simply finding the threshold will save us from lowconfidence predictions. For the next step, it is important to define the IoU ( Intersection over Union) metric. This metric is equal to the ratio of the area of intersecting areas to the area of areas combined.
Image 3
After this, can still remain duplicates, and to get rid of them need to use the “suppression of the nonmaxima ‘ ( the nonmaximum suppression ). The suppression of nonmaxima is as follows: the algorithm takes the bounding box with the highest probability of belonging to an object, then, among the rest of the bordering bounding boxes from this area, it takes the one with the highest IoU and suppresses it.
Due to the fact that everything is done in one run, this model will work almost as fast as classification. In addition, all detections are predicted simultaneously, which means that the model is implicitly aware of the global context. Simply put, the model can find out which objects are usually found together, their relative size and location of objects, and so on.
Image 4
Implementation of Object detection with YOLOv3 in Tensorflow
The first step in implementing YOLO is preparing the laptop and importing the required libraries.
Following this article, we will make a fully convolutional network ( FCN ) without training. In order to apply this network to define objects, we need to download readymade weights from a pretrained model. These weights were obtained from training YOLOv3 on the COCO ( Common Objects in Context ) dataset. The file with the scales can be downloaded from the link to the official website.
# Create a folder for checkpoints with weights. #! mkdir checkpoints # Download the file with weights for YOLOv3 from the official site. #! wget https://pjreddie.com/media/files/yolov3.weights # Import the required libraries. import cv2 import numpy as np import tensorflow as tf from absl import logging from itertools import repeat from PIL import Image from tensorflow.keras import Model from tensorflow.keras.layers import Add, Concatenate, Lambda from tensorflow.keras.layers import Conv2D, Input, LeakyReLU from tensorflow.keras.layers import MaxPool2D, UpSampling2D, ZeroPadding2D from tensorflow.keras.regularizers import l2 from tensorflow.keras.losses import binary_crossentropy from tensorflow.keras.losses import sparse_categorical_crossentropy yolo_iou_threshold = 0.6 yolo_score_threshold = 0.6 # Score threshold. weightyolov3 = 'yolov3.weights' # Path to the file with weights. size = 416 # Image size. checkpoints = 'checkpoints / yolov3.tf' # Path to the checkpoint file. num_classes = 80 # The number of classes in the model. YOLO_V3_LAYERS = [ 'yolo_darknet', 'yolo_conv_0', 'yolo_output_0', 'yolo_conv_1', 'yolo_output_1', 'yolo_conv_2', 'yolo_output_2' ]
Due to the fact that the order of layers in the Darknet ( opensource NN framework ) and tf.keras is different, loading weights using a purely functional API will be problematic. In this case, the best solution would be to create submodels in keras. TF Checkpoints are recommended for saving nested submodels and are officially supported by Tensorflow.
# Function to load the weights of the trained model. def load_darknet_weights(model, weights_file): wf = open(weights_file, 'rb') layers = YOLO_V3_LAYERS for lay_name in layers: s_model = model.get_layer(lay_name) for i, layer in enumerate(s_model.layers): if not layer.name.startswith('conv2d'): continue batch_n = None if i + 1 < len(s_model.layers) and s_model.layers[i + 1].name.startswith('batch_n'): batch_n = s_model.layers[i + 1] logging.info("{}/{} {}".format( s_model.name, layer.name, 'bn' if batch_n else 'bias')) ft = layer.filters size = layer.kernel_size[0] in_dim = layer.input_shape[1] if batch_n is None: conv_bias = np.fromfile(wf, dtype=np.float32, count=ft) else: bn_weight = np.fromfile(wf, dtype=np.float32, count=4*ft) bn_weight = bn_weight.reshape((4, ft))[[1, 0, 2, 3]] c_shape = (ft, in_dim, size, size) c_weights = np.fromfile(wf, dtype=np.float32, count=np.product(c_shape)) c_weights = c_weights.reshape(c_shape).transpose([2, 3, 1, 0]) if batch_n is None: layer.set_weights([c_weights, c_bias]) else: layer.set_weights([c_weights]) batch_n.set_weights(bn_weight) assert len(wf.read()) == 0, 'declined!!' wf.close()
At the same stage, we must define a function to calculate the IoU. We use batch normalization ( packet normalization ) to normalize the results to speed up training. Since tf.keras.layers.BatchNormalization does not work very well for the transfer of learning ( transfer learning ), then we use a different approach.
# Function for calculating IoU. def interval_overlap(int_1, int_2): y1, y2 = int_1 y3, y4 = int_2 if y3 < y1: return 0 if y4 < y1 else (min(y2,y4)  y1) else: return 0 if y2 < y3 else (min(y2,y4)  y3) def intersectionOverUnion(b1, b2): inter_w = int_overlap([b1.xmin, b1.xmax], [b2.xmin, b2.xmax]) inter_h = int_overlap([b1.ymin, b1.ymax], [b2.ymin, b2.ymax]) inter_area = inter_w * inter_h w1, h1 = b1.xmaxb1.xmin, b1.ymaxb1.ymin w2, h2 = b2.xmaxb2.xmin, b2.ymaxb2.ymin union_area = w1*h1 + w2*h2  inter_area return float(inter_area) / union_area class BatchNormalization(tf.keras.layers.BatchNormalization): def call(self, x, train=False): if train is None: train = tf.constant(False) train = tf.logical_and(train, self.trainable) return super().call(x, train) # Define 3 anchor boxes for each cell. y_anch = np.array([(10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (57, 117), (114, 86), (154, 194), (373, 323)], np.float32) / 416 y_anchor_mask = np.array([[6, 7, 8], [3, 4, 5], [0, 1, 2]])
At each scale, we define 3 anchor boxes for each cell. In our case, if the mask is:

0, 1, 2 – means the first three anchor boxes will be used

3, 4, 5 – means the fourth, fifth and sixth will be used

6, 7, 8 – means that the seventh, eighth, ninth will be used
# Function for drawing bounding boxes. def draw_outputs(img, op, c_name, white_list=None): box, score, class, nums = op box, score, class, nums = box[0], score[0], class[0], nums[0] wh = np.flip(img.shape[0:2]) for i in range(nums): if c_name[int(classes[i])] not in white_list: continue x_1y_1 = tuple((np.array(box[i][0:2]) * wh).astype(np.int32)) x_2y_2 = tuple((np.array(box[i][2:4]) * wh).astype(np.int32)) ig = cv2.rectangle(ig, x_1y_1, x_2y_2, (255, 0, 0), 2) ig = cv2.putText(img, '{} {:.4f}'.format( c_name[int(classes[i])], score[i]), x_1y_1, cv2.FONT_HERS_COMP_SMALL, 1, (0, 0, 255), 2) return img
Now it’s time to implement YOLOv3. The idea is to only use convolutional layers. Since there are 53 of them, the easiest way is to create a function into which we will pass important parameters that change from layer to layer.
Image 5
are used for the study of signs. The residual block contains several convolutional layers and additional links to traverse these layers.
Image 6
By creating our model, we are building our model with a functional API that will be easy to use. With it, we can easily define branches in our architecture ( ResNet Block ) and divide layers within the architecture.
def DarkConv(x, ft, size, stride=1, batch_n=True): if stride == 1: padding = 'same' else: x = ZeroPadding2D(((1, 0), (1, 0)))(x) padding = 'valid' x = Conv2D(ft=ft, kernel_size=size, stride=stride, padding=padding, use_bias=not batch_n, k_regular=l2(0.0005))(x) if batch_norm: x = BatchNormalization()(x) x = LeakyReLU(alpha=0.1)(x) return x def DarknetResidual(x, ft): prev = x x = DarkConv(x, ft // 2, 1) x = DarkConv(x, ft, 3) x = Add()([prev , x]) return x def DarkBlock(x, ft, block): x = DarkConv(x, ft, 3, stride=2) for _ in repeat(None, block): x = DarknetResidual(x, ft) return x def Darknet(name=None): y = inputs = Input([None, None, 3]) y = DarkConv(y, 32, 3) y = DarkBlock(y, 64, 1) y = DarkBlock(y, 128, 2) y = x_36 = DarknetBlock(y, 256, 8) y = x_61 = DarknetBlock(y, 512, 8) y = DarkBlock(y, 1024, 4) return tf.keras.Model(inputs, (x_36, x_61, x), name=name) def YoloConv(ft, name=None): def yolo_conv(x_in): if isinstance(x_in, tuple): inp = Input(x_in[0].shape[1:]), Input(x_in[1].shape[1:]) x, x_skip = inp x = DarkConv(x, filters, 1) x = UpSampling2D(2)(x) x = Concatenate()([x, x_skip]) else: x = Input(x_in.shape[1:]) x = DarkConv(x, filters, 1) x = DarkConv(x, filters * 2, 3) x = DarkConv(x, filters, 1) x = DarkConv(x, filters * 2, 3) x = DarkConv(x, filters, 1) return Model(inputs, x, name=name)(x_in) return yolo_conv def YoloOutput(ft, anch, class, name=None): def yolo_op(x_in): x = inp = Input(x_in.shape[1:]) x = DarkConv(x, ft * 2, 3) x = DarkConv(x, anch * (class + 5), 1, batch_n=False) return tf.keras.Model(inputs, x, name=name)(x_in) return yolo_op def yolo_boxes(pred, anch, class): g_size = tf.shape(pred)[1] b_xy, b_wh, score, c_prob = tf.split(pred, (2, 2, 1, class), axis=1) b_xy = tf.sigmoid(b_xy) sc = tf.sigmoid(sc) c_prob = tf.sigmoid(c_prob) pred_box = tf.concat((b_xy, b_wh), axis=1) gr = tf.meshgrid(tf.range(g_size), tf.range(g_size)) gr = tf.expand_dims(tf.stack(gr, axis=1), axis=2) b_xy = (b_xy + tf.cast(gr, tf.float32)) / tf.cast(g_size, tf.float32) b_wh = tf.exp(box_wh) * anch b_x1y1 = b_xy  b_wh / 2 b_x2y2 = b_xy + b_wh / 2 bbox = tf.concat([b_x1y1, b_x2y2], axis=1) return bbox, score, c_prob, pred_box
Now let’s define a nonmaximum suppression function.
def nonMaximumSuppression(op, anch, mask, class): boxes, conf, o_type = [], [], [] for output in op: boxes.append(tf.reshape(op[0], (tf.shape(op[0])[0], 1, tf.shape(op[0])[1]))) conf.append(tf.reshape(output[1], (tf.shape(op[1])[0], 1, tf.shape(op[1])[1]))) o_type.append(tf.reshape(op[2], (tf.shape(op[2])[0], 1, tf.shape(op[2])[1]))) bbox = tf.concat(boxes, axis=1) confidence = tf.concat(conf, axis=1) c_prob = tf.concat(o_type, axis=1) scores = confidence * c_prob boxes, scores, classes, valid_detections = tf.image.combined_non_max_suppression( boxes=tf.reshape(bbox, (tf.shape(bbox)[0], 1, 1, 4)), scores=tf.reshape( scores, (tf.shape(scores)[0], 1, tf.shape(scores)[1])), max_output_size_per_class=100, max_total_size=100, iou_threshold=yolo_iou_threshold, score_threshold=yolo_score_threshold) return boxes, scores, classes, valid_detections
Main function:
def YoloV3(size=None, chan=3, anchors=yolo_anchors, masks=yolo_anchor_masks, classes=80, training=False): x = inputs = Input([size, size, chan]) x_36, x_61, x = Darknet(name='yolo_darknet')(x) x = YoloConv(512, name='yolo_conv_0')(x) output_0 = YoloOutput(512, len(masks[0]), classes, name='yolo_output_0')(x) x = YoloConv(256, name='yolo_conv_1')((x, x_61)) output_1 = YoloOutput(256, len(masks[1]), classes, name='yolo_output_1')(x) x = YoloConv(128, name='yolo_conv_2')((x, x_36)) output_2 = YoloOutput(128, len(masks[2]), classes, name='yolo_output_2')(x) if training: return Model(inputs, (output_0, output_1, output_2), name='yolov3') return Model(inputs, outputs, name='yolov3')
Loss function:
def YoloLoss(anchors, classes=80, ignore_thresh=0.5): def yolo_loss(y_true, y_pred): pred_box, pred_obj, pred_class, pred_xywh = yolo_boxes( y_pred, anchors, classes) pred_xy = pred_xywh[..., 0:2] pred_wh = pred_xywh[..., 2:4] t_box, true_obj, true_class_idx = tf.split( y_true, (4, 1, 1), axis=1) tr_xy = (t_box[..., 0:2] + t_box[..., 2:4]) / 2 t_wh = t_box[..., 2:4]  t_box[..., 0:2] b_loss = 2  t_wh[..., 0] * t_wh[..., 1] g_size = tf.shape(y_true)[1] grid = tf.meshgrid(tf.range(g_size), tf.range(g_size)) grid = tf.expand_dims(tf.stack(grid, axis=1), axis=2) tr_xy = tr_xy * tf.cast(g_size, tf.float32)  tf.cast(grid, tf.float32) t_wh = tf.math.log(t_wh / anchors) t_wh = tf.where(tf.math.is_inf(t_wh), tf.zeros_like(t_wh), t_wh) obj_mask = tf.squeeze(true_obj, 1) t_box_flat = tf.boolean_mask(t_box, tf.cast(obj_mask, tf.bool)) best_iou = tf.reduce_max(intersectionOverUnion( pred_box, true_box_flat), axis=1) ignore_mask = tf.cast(best_iou < ignore_thresh, tf.float32) xy_loss = obj_mask * b_loss * tf.reduce_sum(tf.square(true_xy  pred_xy), axis=1) wh_loss = obj_mask * b_loss * tf.reduce_sum(tf.square(true_wh  pred_wh), axis=1) o_loss = binary_crossentropy(true_obj, pred_obj) o_loss = obj_mask * o_loss + (1  obj_mask) * ignore_mask * o_loss class_loss = obj_mask * sparse_categorical_crossentropy( true_class_idx, pred_class) xy_loss = tf.reduce_sum(xy_loss, axis=(1, 2, 3)) wh_loss = tf.reduce_sum(wh_loss, axis=(1, 2, 3)) o_loss = tf.reduce_sum(o_loss, axis=(1, 2, 3)) class_loss = tf.reduce_sum(class_loss, axis=(1, 2, 3)) return xy_loss + wh_loss + o_loss + class_loss return yolo_loss
The function “transform targets” returns a tuple from the forms:
(
[N, 13, 13, 3, 6], [N, 26, 26, 3, 6], [N, 52, 52, 3, 6])
Where N is the number of labels in the package, and the number 6 means the [x, y, w, h, obj, class]bounding box.
@tf.function
def transform_targets_for_output(y_true, grid_size, anchor_idxs, classes): N = tf.shape(y_true)[0] y_true_out = tf.zeros( (N, grid_size, grid_size, tf.shape(anchor_idxs)[0], 6)) anchor_idxs = tf.cast(anchor_idxs, tf.int32) indexes = tf.TensorArray(tf.int32, 1, dynamic_size=True) updates = tf.TensorArray(tf.float32, 1, dynamic_size=True) idx = 0 for i in tf.range(N): for j in tf.range(tf.shape(y_true)[1]): if tf.equal(y_true[i][j][2], 0): continue anchor_eq = tf.equal( anchor_idxs, tf.cast(y_true[i][j][5], tf.int32)) if tf.reduce_any(anchor_eq): box = y_true[i][j][0:4] anchor_idx = tf.cast(tf.where(anchor_eq), tf.int32) grid_xy = tf.cast(b_x_y // (1/grid_size), tf.int32) indexes = indexes.write( idxes, [i, grid_xy[1], grid_xy[0], anchor_idx[0][0]]) idxes += 1 return tf.tensor_scatter_nd_update( y_true_out, indexes.stack(), updates.stack()) def transform_targets(y_train, anchors, anchor_masks, classes): outputs = [] grid_size = 13 anchors = tf.cast(anchors, tf.float32) a_area = anchors[..., 0] * anchors[..., 1] b_wh = y_train[..., 2:4]  y_train[..., 0:2] b_wh = tf.tile(tf.expand_dims(box_wh, 2), (1, 1, tf.shape(anchors)[0], 1)) b_area = b_wh[..., 0] * box_wh[..., 1] inters = tf.minimum(b_wh[..., 0], anchors[..., 0]) * tf.minimum(b_wh[..., 1], anchors[..., 1]) iou = inters / (box_area + a_area  inters) anchor_idx = tf.cast(tf.argmax(iou, axis=1), tf.float32) anchor_idx = tf.expand_dims(anchor_idx, axis=1) y_train = tf.concat([y_train, anchor_idx], axis=1) for anchor_idxs in anchor_masks: outputs.append(transform_targets_for_output( y_train, grid_size, anchor_idxs, classes)) grid_size *= 2 return tuple(outputs) def preprocess_image(x_train, size): return (tf.image.resize(x_train, (size, size))) / 255
Finally, we are able to create our model, class names, and load weights. There are 80 of them in the COCO dataset.
yolo = YoloV3(classes=num_classes) load_darknet_weights(yolo, weightyolov3) yolo.save_weights(checkpoints) def detect_objects(img_path, white_list=None): image = img_path img = tf.image.decode_image(open(image, 'rb').read(), channels=3) img = tf.expand_dims(img, 0) img = preprocess_image(img, size) bx, scor, class, num = yolo(img) img = cv2.imread(image) img = draw_outputs(img, (bx, scor, class, num), class_names, white_list) cv2.imwrite('detected_{:}'.format(img_path), img) detected = Image.open('detected_{:}'.format(img_path)) detected.show() detect_objects('test.jpg', ['bear'])
Outcome
In this article, we talked about the distinctive features of YOLOv3 and its advantages over other models. We looked at how to implement it using TensorFlow 2.0 (TF must be at least version 2.0).
References
Image 1 https://www.google.com/url?sa=i&url=https%3A%2F%2Ftechhi.netlify.app%2Farticles%2Fhi556404%2Findex.html&psig=AOvVaw3NqRUq7UcPypeYY1mh53yt&ust=1631013840998000&source=images&cd=vfe&ved=0CAsQjRxqFwoTCKjunpWe6vICFQAAAAAdAAAAABAD
Image 2 – https://www.google.com/url?sa=i&url=https%3A%2F%2Ftechhi.netlify.app%2Farticles%2Fhi556404%2Findex.html&psig=AOvVaw3ugO2Kkt3fEeRUZgLeNlmV&ust=1631013745856000&source=images&cd=vfe&ved=2ahUKEwiR3OroneryAhU5ALcAHYulCUgQjRx6BAgAEAk
Image 3 – https://www.google.com/url?sa=i&url=https%3A%2F%2Fwww.slideshare.net%2Frsanaei%2Fconvolutionalneuralnetworks249580831&psig=AOvVaw2j7LjApcCDhm6FvxsrUJsc&ust=1631013680671000&source=images&cd=vfe&ved=0CAsQjRxqFwoTCODXxMid6vICFQAAAAAdAAAAABAD
Image 4 – https://colab.research.google.com/github/maticvl/dataHacker/blob/master/CNN/DataHacker_rs_%20YoloV3%20TF2.0.ipynb
Image 5 – https://www.google.com/url?sa=i&url=https%3A%2F%2Fprogrammersought.com%2Farticle%2F93361001378%2F&psig=AOvVaw1btCsbFPkb28Btv8lOYvap&ust=1631013543269000&source=images&cd=vfe&ved=2ahUKEwj9552IneryAhVFKbcAHa7PCkkQjRx6BAgAEAk
Image 6 – https://miro.medium.com/max/875/1*6HDuqhUzP92iXhHoS0Wl3w.png