Optical Character Recognition using Pytesseract
This article was published as a part of the Data Science Blogathon
In this blog, we will be using computer vision techniques to extract the text from the images. After extracting the text we will apply some basic functions of OpenCV on that text to enhance it and to get more accurate results. This project will be very useful as it will save time and effort of typing from an image.
- This application could be time-saving for giant organizations which will fetch the text from images.
- It can open the world of “paperless documentation” which also helps to upgrade the storage.
- It can also help in the automation process as it can fetch the text from the images themselves.
We will be importing the request library for fetching the URL for git files and images.
#import requests to install tesseract import requests
Note: Now for downloading the tesseract file one can simply go to the link which I’ll be giving as a parameter in the function yet I’m just giving another way to download the tesseract file.
# Downloading tesseract-ocr file r = requests.get("https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/ind.traineddata", stream = True)
Writing data to file to avoid path issues
with open("ind.traineddata", "wb") as file: for block in r.iter_content(chunk_size = 1024): if block: file.write(block)
We will now download tesseract which is required for the Pytesseract library to run and save the file at the path in the open() function.
!pip install pytesseract
This command will install the Pytesseract module if you want to install it in a notebook.
Requirement already satisfied: pytesseract in c:programdataanaconda3libsite-packages (0.3.8) Requirement already satisfied: Pillow in c:programdataanaconda3libsite-packages (from pytesseract) (8.0.1)
In this step, we will install the required libraries for OCR and we will also import IPython functions to clear the undesired.
Installing libraries required for optical character recognition
! apt install tesseract-ocr libtesseract-dev libmagickwand-dev
Importing IPython to clear output which is not important
from IPython.display import HTML, clear_output clear_output()
Now, we will install the Pytesseract and OpenCV libraries which are the soul for our text recognition.
Installing the Pytesseract and OpenCV
! pip install pytesseract wand opencv-python clear_output()
Importing required libraries
# Import libraries from PIL import Image import pytesseract import cv2 import numpy as np from pytesseract import Output import re
In this step, we will open an image resize it, and then again save it for further use and visualize it.
Reading image from URL
image = Image.open(requests.get('https://i.stack.imgur.com/pbIdS.png', stream=True).raw) image = image.resize((300,150)) image.save('sample.png') image
Setting the path for tesseract
pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe'
Note: Above command will set the path of the tesseract library in a system configuration if the path is not set according to the system configuration then even if the tesseract is installed then too it will throw an error.
Here we will be extracting the text from the image with custom configuration.
# Simply extracting text from image custom_config = r'-l eng --oem 3 --psm 6' text = pytesseract.image_to_string(image,config=custom_config) print(text)
Here in the custom configuration you can see the “eng” which indicates the English language i.e it will recognize the English letters you can also add multiple languages and “PSM” means Page segmentation which set the configuration of how the chunks will recognize the characters and “OEM” is the default configuration.
Now we will remove unwanted symbols from the text we extracted by replacing the symbol with an empty string.
# Extracting text from image and removing irrelevant symbols from characters try: text=pytesseract.image_to_string(image,lang="eng") characters_to_remove = "!()@—*“>+-/,'|£#%$&^_~" new_string = text for character in characters_to_remove: new_string = new_string.replace(character, "") print(new_string) except IOError as e: print("Error (%s)." % e)
In the below cell, we are reading the image into OpenCV format to process it further. This is required when we need to extract the text from complex images.
Now we will perform OpenCV operations to get a text from complex images.
image = cv2.imread('sample.png') # will read in the array format
Converting the images to grayscale so that it becomes less complex to process as it will have only two values 0 and 1. Here we are using the cv2.cvtColor() method to convert the colored image into the grayscale format and cv2.cvtColor can actually help in the 150 color conversion of the images.
def get_grayscale(image): return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY) gray = get_grayscale(image) Image.fromarray(gray)
Now we will blur the image so that we can remove the noise from the image. Here, we are using the function cv2.medianBlur() function with an aim of reducing the noise from the image, blurring is basically the technique that smoothens the images by applying relevant smoothing filters is one of the widely used methods in image processing.
def remove_noise(image): return cv2.medianBlur(image,5) noise = remove_noise(gray) Image.fromarray(gray)
We will perform threshold transformation here. cv2. Thresholding works on the simple concept i.e whenever the pixel value is lower than the given threshold value then the color is white otherwise the pixel color would be just opposite which is black. The function used is cv2.threshold.
def thresholding(image): # source image, grayscale image return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) thresh = thresholding(gray) Image.fromarray(thresh)
Here we are doing erode transformation. Erode transformation is one of the most basic and important steps in image transformation in erode transformation it usually fits the missing shapes and lattices in the images which later helps in recognizing the character in the images when it is slightly blurred or distorted. Here for eroding transformation we are using the erode() function from the cv2 library.
def erode(image): kernel = np.ones((5,5),np.uint8) return cv2.erode(image, kernel, iterations = 1) erode = erode(gray) Image.fromarray(erode)
Here we will perform the morphological transformation. Morphological transformation is one of the best-suited techniques for binary images where it sorts the image according to its pixel values rather than going for numerical values of the images keeping into account the threshold values too.
def opening(image): kernel = np.ones((5,5),np.uint8) return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel) opening = opening(gray) Image.fromarray(opening)
Here we are trying to match the image. As we are passing the same image for matching we got the similarity of 99.99%. Here, template matching is a method for searching and finding the location of a template image in a larger image. For template matching, we are using the match template() function from the cv2 library.
def match_template(image, template): return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) match = match_template(gray, gray) match
Now we will segregate every character in the text by creating a rectangle around it.
# Drawing rectangle around text img = cv2.imread('sample.png') h, w, c = img.shape boxes = pytesseract.image_to_boxes(img) for b in boxes.splitlines(): b = b.split(' ') img = cv2.rectangle(img, (int(b), h - int(b)), (int(b), h - int(b)), (0, 255, 0), 2) Image.fromarray(img)
Finally, we can draw rectangles around a specific pattern or word.
# Drawing pattern on specific pattern or word img = cv2.imread('sample.png') d = pytesseract.image_to_data(img, output_type=Output.DICT) keys = list(d.keys()) date_pattern = 'artificially' n_boxes = len(d['text']) for i in range(n_boxes): if float(d['conf'][i]) > 60: if re.match(date_pattern, d['text'][i]): (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i]) img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2) Image.fromarray(img)
We started with learning how to install tesseract which is used for text extraction. Next, we took an image and extracted the text from that image. We learned that we need to use certain image transformation functions of OpenCV in order to extract text from complex images.
Thank you for reading my article 🙂
I hope you guys will like this step-by-step learning of Optical character recognition using Pytesseract. Here’s the repo link.
Here you can access my other articles which are published on Analytics Vidhya as a part of the Blogathon (link)
If got any queries you can connect with me on LinkedIn, refer to this link
Greeting to everyone, I’m currently working in TCS and previously I worked as a Data Science Associate Analyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field i.e. Data Science along with its other subsets of Artificial Intelligence such as, Computer Vision, Machine learning, and Deep learning feel free to collaborate with me on any project on the above-mentioned domains (LinkedIn).