Understanding Label Detection in Invoices using OpenCV
Document image analysis is the name for the algorithms and methods used to turn the pixels in an image into a description that a computer can understand. Optical Character Recognition, or OCR, uses computer vision to find and read the text in images. OCR can accurately predict the output in a matter of milliseconds. OCR was one of the first problems that computer vision tried to solve, and it has come a long way since then. With the help of these OCR models, we found a way of label detection invoices, such as the vendor’s name, the bill date, the bill number, the bill amount, and the total number of items. To get a high level of accuracy, we used an ensemble technique in which we used different OCRs for detecting and recognizing the labels separately.
Below are the major learning objectives of this article:
- You will learn how to use opencv for label detection on an invoice, such as the invoice number, invoice date, total amount, total number of items, etc.
- You will learn how to get the text’s coordinates from any invoice image.
- You will learn the steps in image preprocessing.
- You will learn how to tell what type of template a new invoice is using the template image dataset.
- Go through the code snippets to understand the above objectives.
This article was published as a part of the Data Science Blogathon.
Table of Contents
- Let’s say we need to detect labels on invoices from different templates and are given a template labels dataset consisting of the labels’ names for several templates.
- If we have a template labels dataset with the names of labels for a number of templ.
- The coordinates for the required labels for each template are stored in a table (csv file).
- Layout mapping is done to find the image template for the new invoice so that labels for the new invoice can be found using the coordinates that have already been stored.
- After the template was found, the coordinates of the labels in the table (csv file) were retrieved.
- The extracted coordinates are used to predict the labels of the new invoice.
Image Preprocessing of Invoices
Since the input is an image of an invoice, we know that preprocessing the images is a very important step that will help us get better results. For this, we used Skew Correction, Binarisation, Noise Filtering, and contour detection as part of the preprocessing.
#binarisation res = cv.adaptiveThreshold(img,255,cv.ADAPTIVE_THRESH_GAUSSIAN_C,\ cv.THRESH_BINARY,11,2) plt.figure(figsize=(100, 60)) plt.imshow(res,'gray') plt.show() #noise filtering cv2.fastNlMeansDenoisingColored(img,None,10,10,7,21) #skew correction import numpy as np from skimage import io from skimage.transform import rotate from skimage.color import rgb2gray from deskew import determine_skew image = io.imread(_img) grayscale = rgb2gray(image) angle = determine_skew(grayscale) rotated = rotate(image, angle, resize=True) * 255 rotated=rotated.astype(np.uint8)
Contour Detection is done because the invoices in the images we have are in different places and we need to find them. This was done with the help of a ” contour detection method.” Find the image’s largest contour, crop it to fit, and show it. This was done by using the cv2.findContours() function to find the edges and the cv2.contourArea() method to find the edge with the most area, then cropping the image to that edge.
contours, hierarchy = cv2.findContours(thresh,cv2.RETR_TREE, cv2.CHAIN_APPROX_NONE) # Find Biggest Contour areas = [cv2.contourArea(c) for c in contours] max_index = np.argmax(areas) # Find approxPoly Of Biggest Contour epsilon = 0.1 * cv2.arcLength(contours[max_index], True) approx = cv2.approxPolyDP(contours[max_index], epsilon, True) # Crop The Image points1 = np.float32(approx) points = np.float32([[0, 0], [width, 0], [width, height], [0, height]]) result = cv2.warpPerspective(img, matrix, (width, height)) matrix = cv2.getPerspectiveTransform(points1, points)
Extracting Coordinates of Labels of Different Invoice Templates
Then, using EasyOCR as the detection model and PaddleOCR as the recognition model, the MultiOcr model is built to get the coordinates of the labels for each invoice template.
reader = easyocr.Reader(['en']) ocr = PaddleOCR(lang='en') #detection def detect_text_blocks(img_path): detection_result = reader.detect(img_path,width_ths=0.7,mag_ratio=1.5) text_coordinates = detection_result return text_coordinates
The MultiOcr model finds the coordinates of label names in the template labels dataset for each template invoice and stores them in a table (csv file). Because the number of items on an invoice can vary, the starting and ending coordinates of the table of invoice items in the invoice image were given to predict how many items were on the invoice.
When the size of the table of items in the invoice image changes, labels like the “total amount” position change. This is because the total amount comes after the table of invoice items in any invoice. To solve this problem, a relative positioning method can be used to guess and detect the total amount. This can be done by storing the coordinates of the strings around the total amount label in the invoice. This is done because the string’s value (or name) doesn’t change, even if the invoices are different but come from the same template.
Finding the Template of any Given New Invoice
- To detect the labels of new invoices, we need to know the template of the invoice. The purpose of the document similarity method is to predict the invoice template
- As the name suggests, document similarity tells you how similar two documents are. Document distance is used to figure out how similar two documents are. The cosine similarity method can be used to do this
- From this, we will be able to obtain the template of the invoice whose labels are to be predicted using this method
- For Example
The document similarity method is used on these three images. Image1 and image2 are from the same vendor, and image3 is from a different vendor. The document similarity results are shown below:
- Image1 – Image2 : The distance is 1.000072 (radians)
- Image1 – Image3 : The distance is 1.408562 (radians)
From the document similarity method results, we can see that the distance between image1 and image2 is less than between image1 and image3. This means that images 1 and 2 are from the same vendor.
Label Detection of the New Invoice Using Template’s Label Coordinates
Since we got the template from the table (csv file), the label coordinates are taken and used to identify invoice image labels.
Example: When an image of an invoice like the one below is given as input, it first looks for the invoice’s template. The table (csv file) is used to get coordinates for the labels. The image labels on the invoice will be identified with these label coordinates.
Methods to Improve Performance
- During preprocessing, different thresholding methods, such as Global Thresholding, Adaptive Mean Thresholding, and Adaptive Gaussian Thresholding, can be used to get a better image of an invoice
- For detection and recognition, the MultiOcr model can use several OCR models, such as PyTesseract, PPOCR, easyOCR, MMOCR, and Keras-OCR. The OCR model that gave the best results will be chosen as the final model
- In the MultiOcr model’s detection step, hyperparameter tuning is done with parameters width_ths, which sets the maximum horizontal distance between two bounding boxes to be merged, and mag_ratio, which scales the image up or down based on the factor given
- Several document similarities methods, such as cosine similarity and Euclidean Distance, can be used to improve the results when predicting the template
In Conclusion, With this work, we propose an algorithm for label detection from the invoices using the MultiOcr Model; we will be able to successfully detect the positions of the labels for templates as well as the labels for any new invoices within the given templates. For this, we used OCR models like easyOCR as the detection model and PaddleOCR as the recognition model. Also, we are happy to say that we are able to give better results with this algorithm.
Key takeaways of this article
- We can get 85% accuracy for contour detection, and the multiOcr model that includes EasyOcr and paddleOCR achieves approximately 95% accuracy.
- The cosine similarity approach determines document similarity with 82.8% precision. False positives may arise if two documents share a large number of terms.
- We have discussed image preprocessing steps, label detection from bills using their coordinates, and invoice template detection.
- Learned some basic codes and concluded the article with an example
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.