Optical Character Recognition using Pytesseract

Aman Preet 31 Dec, 2021
7 min read

This article was published as a part of the Data Science Blogathon

Overview

In this blog, we will be using computer vision techniques to extract the text from the images. After extracting the text we will apply some basic functions of OpenCV on that text to enhance it and to get more accurate results. This project will be very useful as it will save time and effort of typing from an image.

Optical Character Recognition image
Image Source: Sansan blog

Scope

  • This application could be time-saving for giant organizations which will fetch the text from images.
  • It can open the world of “paperless documentation” which also helps to upgrade the storage.
  • It can also help in the automation process as it can fetch the text from the images themselves.

We will be importing the request library for fetching the URL for git files and images.

#import requests to install tesseract
import requests

Note: Now for downloading the tesseract file one can simply go to the link which I’ll be giving as a parameter in the function yet I’m just giving another way to download the tesseract file.

# Downloading tesseract-ocr file
r = requests.get("https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/ind.traineddata", stream = True)

Writing data to file to avoid path issues

with open("ind.traineddata", "wb") as file:  
    for block in r.iter_content(chunk_size = 1024): 
         if block:  
            file.write(block)

We will now download tesseract which is required for the Pytesseract library to run and save the file at the path in the open() function.

!pip install pytesseract

This command will install the Pytesseract module if you want to install it in a notebook.

Requirement already satisfied: pytesseract in c:programdataanaconda3libsite-packages (0.3.8)
Requirement already satisfied: Pillow in c:programdataanaconda3libsite-packages (from pytesseract) (8.0.1)

In this step, we will install the required libraries for OCR and we will also import IPython functions to clear the undesired.

Installing libraries required for optical character recognition

! apt install tesseract-ocr libtesseract-dev libmagickwand-dev

Importing IPython to clear output which is not important

from IPython.display import HTML, clear_output
clear_output()

Now, we will install the Pytesseract and OpenCV libraries which are the soul for our text recognition.

Installing the Pytesseract and OpenCV

! pip install pytesseract wand opencv-python
clear_output()

Importing required libraries

# Import libraries
from PIL import Image
import pytesseract
import cv2
import numpy as np
from pytesseract import Output
import re

In this step, we will open an image resize it, and then again save it for further use and visualize it.

Reading image from URL

image = Image.open(requests.get('https://i.stack.imgur.com/pbIdS.png', stream=True).raw)
image = image.resize((300,150))
image.save('sample.png')
image

Output:

 

  Importing required libraries Optical Character Recognition

Setting the path for tesseract

pytesseract.pytesseract.tesseract_cmd = r'C:Program FilesTesseract-OCRtesseract.exe'

Note: Above command will set the path of the tesseract library in a system configuration if the path is not set according to the system configuration then even if the tesseract is installed then too it will throw an error.

Here we will be extracting the text from the image with custom configuration.

# Simply extracting text from image
custom_config = r'-l eng --oem 3 --psm 6' 
text = pytesseract.image_to_string(image,config=custom_config)
print(text)

Output:

 

Setting the path for tesseract

Here in the custom configuration you can see the “eng” which indicates the English language i.e it will recognize the English letters you can also add multiple languages and “PSM” means Page segmentation which set the configuration of how the chunks will recognize the characters and “OEM” is the default configuration.

Now we will remove unwanted symbols from the text we extracted by replacing the symbol with an empty string.

# Extracting text from image and removing irrelevant symbols from characters
try:
    text=pytesseract.image_to_string(image,lang="eng")
    characters_to_remove = "!()@—*“>+-/,'|£#%$&^_~"
    new_string = text
    for character in characters_to_remove:
        new_string = new_string.replace(character, "")
    print(new_string)
except IOError as e:
    print("Error (%s)." % e)

Output:

 

Setting the path for tesseract 2 |Optical Character Recognition

In the below cell, we are reading the image into OpenCV format to process it further. This is required when we need to extract the text from complex images.

Now we will perform OpenCV operations to get a text from complex images.

image = cv2.imread('sample.png') # will read in the array format

Output:

 

opencv optical character recognition

Converting the images to grayscale so that it becomes less complex to process as it will have only two values 0 and 1. Here we are using the cv2.cvtColor() method to convert the colored image into the grayscale format and cv2.cvtColor can actually help in the 150 color conversion of the images.

Grayscale image

def get_grayscale(image):
    return cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = get_grayscale(image)
Image.fromarray(gray)

Output:

 

Grayscale image

Now we will blur the image so that we can remove the noise from the image. Here, we are using the function cv2.medianBlur() function with an aim of reducing the noise from the image, blurring is basically the technique that smoothens the images by applying relevant smoothing filters is one of the widely used methods in image processing.

Noise removal

def remove_noise(image):
    return cv2.medianBlur(image,5)
noise = remove_noise(gray)
Image.fromarray(gray)

Output:

 

Noise removal

We will perform threshold transformation here. cv2. Thresholding works on the simple concept i.e whenever the pixel value is lower than the given threshold value then the color is white otherwise the pixel color would be just opposite which is black. The function used is cv2.threshold.

Thresholding

def thresholding(image):
                        # source image,  grayscale image
    return cv2.threshold(image, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]
thresh = thresholding(gray)
Image.fromarray(thresh)

Output:

Thresholding

Here we are doing erode transformation. Erode transformation is one of the most basic and important steps in image transformation in erode transformation it usually fits the missing shapes and lattices in the images which later helps in recognizing the character in the images when it is slightly blurred or distorted. Here for eroding transformation we are using the erode() function from the cv2 library.

Erosion

def erode(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.erode(image, kernel, iterations = 1)
erode = erode(gray)
Image.fromarray(erode)

Output:

 

Erosion

Here we will perform the morphological transformation. Morphological transformation is one of the best-suited techniques for binary images where it sorts the image according to its pixel values rather than going for numerical values of the images keeping into account the threshold values too.

Morphology

def opening(image):
    kernel = np.ones((5,5),np.uint8)
    return cv2.morphologyEx(image, cv2.MORPH_OPEN, kernel)
opening = opening(gray)
Image.fromarray(opening)

Output:

 

Morphology

Here we are trying to match the image. As we are passing the same image for matching we got the similarity of 99.99%. Here, template matching is a method for searching and finding the location of a template image in a larger image. For template matching, we are using the match template() function from the cv2 library.

Template matching

def match_template(image, template):
    return cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED)
match = match_template(gray, gray)
match

Output:

array([[1.]], dtype=float32)

Now we will segregate every character in the text by creating a rectangle around it.

# Drawing rectangle around text
img = cv2.imread('sample.png')
h, w, c = img.shape
boxes = pytesseract.image_to_boxes(img) 
for b in boxes.splitlines():
    b = b.split(' ')
    img = cv2.rectangle(img, (int(b[1]), h - int(b[2])), (int(b[3]), h - int(b[4])), (0, 255, 0), 2)
Image.fromarray(img)

Output:

 

Template matching

Finally, we can draw rectangles around a specific pattern or word.

# Drawing pattern on specific pattern or word
img = cv2.imread('sample.png')
d = pytesseract.image_to_data(img, output_type=Output.DICT)
keys = list(d.keys())

date_pattern = 'artificially'

n_boxes = len(d['text'])
for i in range(n_boxes):
    if float(d['conf'][i]) > 60:
        if re.match(date_pattern, d['text'][i]):
            (x, y, w, h) = (d['left'][i], d['top'][i], d['width'][i], d['height'][i])
            img = cv2.rectangle(img, (x, y), (x + w, y + h), (0, 255, 0), 2)
Image.fromarray(img)

Output:

 

Conclusion

We started with learning how to install tesseract which is used for text extraction. Next, we took an image and extracted the text from that image. We learned that we need to use certain image transformation functions of OpenCV in order to extract text from complex images.

End Notes

Thank you for reading my article 🙂

I hope you guys will like this step-by-step learning of Optical character recognition using Pytesseract. Here’s the repo link.

Here you can access my other articles which are published on Analytics Vidhya as a part of the Blogathon (link)

If got any queries you can connect with me on LinkedIn, refer to this link

About me

Greeting to everyone, I’m currently working in TCS and previously I worked as a Data Science Associate Analyst in Zorba Consulting India. Along with full-time work, I’ve got an immense interest in the same field i.e. Data Science along with its other subsets of Artificial Intelligence such as, Computer Vision, Machine learning, and Deep learning feel free to collaborate with me on any project on the above-mentioned domains (LinkedIn).

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Aman Preet 31 Dec, 2021

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,