Scratch Detection Using Mask RCNN & Yolov5

Abhinav Yadav 10 Feb, 2023 • 8 min read


This project focuses on Car Scratch Detection, in sync with the development of autonomous quality inspection systems for different types of products. For example, in a parking lot, such detection provides the client with the assurance that their car will be safe and sound; also, if something happens, the detection system will be useful to handle the situation carefully.

Further, the techniques learned in this project can be supplemented in other projects or used in conjunction with some other problems, such as quality assurance and second-hand car valuation. I have tackled this problem as a single-class classification problem that considers dent, damage, and scratch as scratch and further, with the help of a flask, made a basic app. I will walk you through all the thoughts, code, algorithms, and knowledge I obtained while doing this project, which I implemented via Mask RCNN and Yolov5.

Prediction using Yolv5

This is the end result of the model.

Learning Objective

  1. Learn how to perform custom object detection using Mask RCNN and Yolov5.
  2. Make use of transfer learning while using models trained on the coco dataset and Resnet50.
  3. Learning the importance of quality data collection and data annotation is an integral and the most time-consuming part of any project.

This article was published as a part of the Data Science Blogathon.

Table of Contents

Collecting Our Dataset

In order to collect data, I made a data scraper that uses Beautiful Soup to scrape data from online websites such as adobe, Istock photo, etc.

url = ''

# make a request to the url

r = requests.get(url)

# create our soup

soup = BeautifulSoup(r.text, 'html.parser')

images = soup.find_all('img')

for image in images[-1]:

name = image['alt']

link = image['src']

with open(name.replace(' ', '-').replace('/', '') + '.jpg', 'wb') as f:

im = requests.get(link)


But it didn’t work because most of the images were not scraped because of the websites’ privacy policy regarding scraping. Because of the privacy issues I went forward and downloaded the images directly from Istock photo, Shutter photo, and Adobe.

I started with around 80 images, increasing it to 350 images and further increasing it to around 900 images for the final annotations.

Instance Segmentation with Mask RCNN

Image Segmentation is the segmentation of images based on pixels into different regions. Mask RCNN is a model used for Instance Segmentation, a sub-type of image segmentation that separate instances in an object’s boundaries.  It is built further upon Faster RCNN. While Faster RCNN has two outputs for each object, as a class label and a bounding-box offset, Mask RCNN is the addition of third output i.e the mask of the object.

Mask RCNN Architecture

The architecture of Mask RCNN consists of the following:

  • Backbone Network
  • Region Proposal Network
  • Mask Representation
  • ROI Aign

The advantage of using Mask RCNN to detect scratches in cars is that we can work with polygons and not just bounding boxes, and create a mask on our target further abling us to obtain and visualize the result in a more accurate and succinct way.

Let’s start implementing our problem with Mask RCNN.

 Importing the Libraries

Importing all the libraries required to implement our Mask RCNN algorithm.

# importing libraries
import pandas as pd
import numpy as np
import cv2
import os
import re
from PIL import Image
import albumentations as A
from albumentations.pytorch.transforms import ToTensorV2
import torch
import torchvision
from torchvision.models.detection.faster_rcnn import FastRCNNPredictor
from torchvision.models.detection import FasterRCNN
from torchvision.models.detection.rpn import AnchorGenerator
from import DataLoader, Dataset
from import SequentialSampler
from matplotlib import pyplot as plt

Dividing our Dataset

Data used in this is in .csv format which has x, y, w, and h coordinates of the bounding box, whereas the data is annotated using make-sense which is a data annotator.

image_ids = train_df['image_id'].unique()
valid_ids = image_ids[-10:]
train_ids = image_ids[:-10]
# valid and train df
valid_df = train_df[train_df['image_id'].isin(valid_ids)]
train_df = train_df[train_df['image_id'].isin(train_ids)]

Creating a Scratch Class

Creating our Scratch Dataset class which transforms our dataset and returns the required.

class ScratchDataset(Dataset):
    def __init__(self, dataframe, image_dir, transforms=None):
        self.image_ids = dataframe['image_id'].unique()
        self.df = dataframe
        self.image_dir = image_dir
        self.transforms = transforms
    def __getitem__(self, index: int):
        image_id = self.image_ids[index]
        records = self.df[self.df['image_id'] == image_id]
        image = cv2.imread(f'{self.image_dir}/{image_id}.jpg', cv2.IMREAD_COLOR)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB).astype(np.float32)
        image /= 255.0
        boxes = records[['x', 'y', 'w', 'h']].values
        boxes[:, 2] = boxes[:, 0] + boxes[:, 2]
        boxes[:, 3] = boxes[:, 1] + boxes[:, 3]
        area = (boxes[:, 3] - boxes[:, 1]) * (boxes[:, 2]-boxes[:, 0])
        area = torch.as_tensor(area, dtype=torch.float32)
        # there is only one class
        labels = torch.ones((records.shape[0],), dtype=torch.int64)
        # suppose all instances are not crowd
        iscrowd = torch.zeros((records.shape[0],), dtype=torch.int64)
        target = {}
        if self.transforms:
            sample = {
                'bboxes': target['boxes'],
                'labels': labels
            sample = self.transforms(**sample)
            image = sample['image']
            target['boxes'] = torch.tensor(sample['bboxes'])
        return image, target, image_id
    def __len__(self) -> int:
        return self.image_ids.shape[0]

Here  ‘img_dir’, is the path to directory where images are saved.

Data Augmentation

Here we are using Albumentations for data augmentation.

# Albumenations
def get_train_transform():
    return A.Compose([
    ], bbox_params={'format':'pascal_voc', 'label_fields':['labels']})
def get_valid_transform():
    return A.Compose([
    ], bbox_params={'format': 'pascal_voc', 'label_fields':['labels']})

Creating Our Model

We are gonna use the Resnet50 model along with Mask RCNN.

# load a model pre-trained on COCO
model = torchvision.models.detection.fasterrcnn_resnet50_fpn(pretrained=True)
num_classes = 2 # 1 class scratch+ background
# get number of input features for the classifier
in_features = model.roi_heads.box_predictor.cls_score.in_features
# replace th epre-trained head with a new one
model.roi_heads.box_predictor = FastRCNNPredictor(in_features, num_classes)

Let’s move towards creating an Averager class and training and validation data loader, which are going to be the key component while training our model.

class Averager:
    def __init__(self):
        self.current_total = 0.0
        self.iterations = 0.0
    def send(self, value):
        self.current_total += value
        self.iterations += 1
    def value(self):
        if self.iterations == 0:
            return 0
            return 1.0 * self.current_total/ self.iterations
    def reset(self):
        self.current_total = 0.0
        self.iterations = 0.0
def collate_fn(batch):
    return tuple(zip(*batch))
train_dataset = WheatDataset(train_df, DIR_TRAIN, get_train_transform())
valid_dataset = WheatDataset(valid_df, DIR_TRAIN, get_valid_transform())
# split the dataset in train and test set
indices = torch.randperm(len(train_dataset)).tolist()
train_data_loader = DataLoader(
valid_data_loader = DataLoader(

Training our Model

We are activating ‘cuda’ and accessing the GPU if it is available to us. Further our weight_decay=0.0005, momentum=0.9, and a dynamic learning rate that starts with 0.05

device = torch.device('cuda') if torch.cuda.is_available() else torch.device('cpu')
images, targets, image_ids = next(iter(train_data_loader))
params = [p for p in model.parameters() if p.requires_grad]
optimizer = torch.optim.SGD(params, lr=0.005, momentum=0.9, 
# lr_scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=3, gamma=0.1)
lr_scheduler = None
num_epochs = 2
loss_hist = Averager()
for epoch in range(num_epochs):
    for images, targets, image_ids, in train_data_loader:
        images = list( for image in images)
        targets = [{k: for k, v in t.items()} for t in targets]
        loss_dict = model(images, targets)
        losses = sum(loss for loss in loss_dict.values())
        loss_value = losses.item()
        if itr % 50 == 0:
            print(f'Iteration #{itr} loss: {loss_value}')
        itr += 1
    # update the learning rate
    if lr_scheduler is not None:
    print(f'Epoch #{epoch} loss: {loss_hist.value}')

But I could not implement this as it took 10 hours and more for meager 80 images.
The time complexity of using Mask RCNN for custom training is huge, and you need a lot of computing power which wasn’t available for me.

I hope you have a good computing machine and you can implement it.

Object Detection Through Yolov5

Primarily used for object detection, Yolo is released by Ultralytics[github], which has become the benchmark algorithm, for instance, segmentation in visual data. Yolov5 is faster and more efficient than Yolov4, and it generalizes well to new images.


Yolov5 Architecture

The algorithm works based on the following:

  • Residual blocks
  • Bounding box regression
  • Intersection Over Unions(IOU)
  • Non-Maximum Suppression.a

Yolov5 is faster, smaller, and roughly as accurate as previous versions. Trained on the coco dataset, it works well with bounding boxes.

Let’s start with the implementation of Yolov5 in our problem case; I have used google collab to run the code therein.

Data Annotation

  • I used a make-sense data annotator for annotating the dataset.
  • When the data is annotated precisely i.e., small, and to the point, Yolo doesn’t work well enough since it doesn’t generalize well to small bounding boxes.
  • Therefore data annotation is a bit tricky, and the regions should be annotated uniformly.


After loading the model,

model = torch.hub.load('ultralytics/yolov5','yolov5s')

We added the yaml file and data as required for working on Yolo (images in one folder, whereas annotations as text files in another folder) we trained our model with a batch size of 16 and an image size of 320*320.

!cd yolov5 && python --img 320 --batch 16 --epochs 50 --data carScr_up.yaml --weights

Though in the Yolo documentation, it is stated to run for 300 epochs to get good results, we have brought it down to 50 epochs, and after hyperparameter tuning, our model started doing pretty well even within 30 epochs.

For hyperparameter tuning, we use evolve provided by Yolo, wherein data is trained for 10 epochs for 300 evolution.

!cd yolov5 && python --img 320 --batch 32 --epochs 10 --data carScr_up.yaml --weights --cache --evolve


The results started with

Exp Precision Recall mAP_0.5
1 0.003 0.511 0.001
2 0.659 0.311 0.363
3 0.624 0.536 0.512
4 0.572 0.610 0.519

The below image represents experiment 4 and each experiment is trained on a different number of images and annotations. The predictions for cars with scratches are as follows:

Prediction using Yolov5

The precision and recall, in this case, are small because with Yolo, we are dealing with bounding boxes and these metrics depends upon the Intersection of the Union(IOU) of the actual and predicted boxes.

Let’s look at the metrics obtained after training our dataset with Yolov5 for 50 epochs


We can see that after 20 epochs the progress stagnates, thus Yolo is pretty fast to learn the relationship and generalize well to our problem statement, even though the data we had was below 1000 images.


We can see that Yolov5 and Mask RCNN works really well for our problem statement, though I couldn’t implement the latter code. Yolov5 works pretty well in keeping up with our problem statement. While doing custom training with Yolov5, barring the metrics, it is able to predict extremely well, detecting all the scratches and damage in a sample image. Thus we have a pretty good model wherein we learn how to collect, annotate and train different models and what it takes to train different models.

  • In the above, I have considered damage and scratch as a single class.
  • Data Annotations and collection is an integral and exhaustive part of this solution.
  • We can definitely do better if we use polygon and increase our dataset size.

PS: This doesn’t work well with the car with no damage. Since we trained it on only data containing cars containing scratches and damages. We can definitely generalize this to suit our needs

Alternatively, we can follow the link to the research paper mentioned below, where the images are divided into 3*3 grids and used as our training data. This will result in an increase in the ratio of scratch to image, thus generalizing well to the dataset and improving our metrics.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Abhinav Yadav 10 Feb 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers