Harika Bonthu — June 24, 2021
Advanced Computer Vision Image Image Analysis Object Detection Project Python Structured Data Supervised

This article was published as a part of the Data Science Blogathon

 

Intro

We, humans, read text almost every minute of our life.  Wouldn’t it be great if our machines or systems could also read the text just like the way we do? But the bigger question is “How do we make our machines read”? This is where Optical Character Recognition (OCR) comes into the picture.

Optical Character Recognition (OCR)

Optical Character Recognition (OCR) is a technique of reading or grabbing text from printed or scanned photos, handwritten images and convert them into a digital format that can be editable and searchable.

Applications

OCR has plenty of applications in today’s business. A few of them are listed below:

  • Passport recognition in Airports
  • Automation of Data Entry
  • License plates recognition
  • Extracting business card information into a contact list
  • Converting handwritten documents into electronic images
  • Creating Searchable PDFs
  • Create audible files (text to audio)

Some of the Open Source OCR tools are TesseractOCRopus.

In this article, we will focus on Tesseract OCR. And to read the images we need OpenCV.

Installation of Tesseract OCR:

Download the latest installer for windows 10 from “https://github.com/UB-Mannheim/tesseract/wiki“. Execute the .exe file once it is downloaded.

Note: Don’t forget to copy the file software installation path. We will require it later as we need to add the path of the tesseract executable in the code if the directory of installation is different from the default.

The typical installation path in Windows systems is C:Program Files.

So, in my case, it is “C: Program FilesTesseract-OCRtesseract.exe“.

Next, to install the Python wrapper for Tesseract, open the command prompt and execute the command “pip install pytesseract“.

OpenCV

OpenCV(Open Source Computer Vision) is an open-source library for computer vision, machine learning, and image processing applications.

OpenCV-Python is the Python API for OpenCV.

To install it, open the command prompt and execute the command “pip install opencv-python“.

 

Build sample OCR Script

1. Reading a sample Image

import cv2

Read the image using cv2.imread() method and store it in a variable “img”.

img = cv2.imread("image.jpg")

If needed, resize the image using cv2.resize() method

img = cv2.resize(img, (400, 400))

Display the image using cv2.imshow() method

cv2.imshow("Image", img)

Display the window infinitely (to prevent the kernel from crashing)

cv2.waitKey(0)

Close all open windows

cv2.destroyAllWindows()

2. Converting Image to String

import pytesseract

Set the tesseract path in the code

pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'

The below error occurs if we do not set the path.

Optical Character Recognition 2

To convert an image to string use pytesseract.image_to_string(img) and store it in a variable “text”

text = pytesseract.image_to_string(img)

print the result

print(text)

Complete code: 

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'
img = cv2.imread("image.jpg")
img = cv2.resize(img, (400, 450))
cv2.imshow("Image", img)
text = pytesseract.image_to_string(img)
print(text)
cv2.waitKey(0)
cv2.destroyAllWindows()

The output for the above code:

Optical Character Recognition script

The output of the above code

If we observe the output, the main quote is extracted perfectly, but the philosopher’s name and the text at the very bottom of the image are not obtained.

To extract the text accurately and to avoid accuracy drop, we need to do some preprocessing of the image. I found this article (https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7) quite helpful. Refer to it for a better understanding of preprocessing techniques.

Perfect! Now that we have got the basics required, Let us see some simple applications of OCR.

 

1. Building word clouds on Review images

Word cloud is a visual representation of word frequency. The bigger the word appears in a word cloud, the more commonly the word is used in the text.

For this, I took some snapshots of reviews from Amazon for the product Apple iPad 8th Generation.

Sample image

sample image
Sample review image

Steps: 

  1. Create a list of all the available review images
  2. If needed view the images using cv2.imshow() method
  3. Read text from images using pytesseract
  4. Create a data frame
  5. Preprocess the text – remove special characters, stop words
  6. Build positive, negative word clouds

Step 1: Create a list of all the available review images

import os
folderPath = "Reviews"
myRevList = os.listdir(folderPath)

Step 2: If needed view the images using cv2.imshow() method

for image in  myRevList:
    img = cv2.imread(f'{folderPath}/{image}')
    cv2.imshow("Image", img)
    cv2.waitKey(0)
    cv2.destroyAllWindows()

Step 3: Read text from images using pytesseract

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'
corpus = []
for images in myRevList:
    img = cv2.imread(f'{folderPath}/{images}')
    if img is None:
        corpus.append("Could not read the image.")
    else:
        rev = pytesseract.image_to_string(img)
        corpus.append(rev)
list(corpus)
corpus
text from images Optical Character Recognition
The output of the above code

Step 4: Create a data frame

import pandas as pd
data = pd.DataFrame(list(corpus), columns=['Review'])
data
data head

Step 5: Preprocess the text – remove special characters, stopwords

#removing special characters
import re
def clean(text):
    return re.sub('[^A-Za-z0-9" "]+', ' ', text)
data['Cleaned Review'] = data['Review'].apply(clean)
data
Optical Character Recognition text pre-processing

Removing stopwords from the ‘Cleaned Review’ and appending all the remaining words to a list variable “final_list”.

  1. # removing stopwords
    import nltk
    from nltk.corpus import stopwords
    nltk.download("punkt")
    from nltk import word_tokenize
    stop_words = stopwords.words('english')
    
    final_list = []
    for column in data[['Cleaned Review']]:
        columnSeriesObj = data[column]
        all_rev = columnSeriesObj.values
    
        for i in range(len(all_rev)):
            tokens = word_tokenize(all_rev[i])
            for word in tokens:
                if word.lower() not in stop_words:
                    final_list.append(word)

Step 6: Build positive, negative word clouds

Install word cloud library using the command “pip install wordcloud“.

In the English language, we have a predefined set of positive, negative words called Opinion Lexicons. These files can be downloaded from the link or directly from my GitHub repo.

Once the files are downloaded, read those files in the code and create a list of positive, negative words.

with open(r"opinion-lexicon-Englishpositive-words.txt","r") as pos:
  poswords = pos.read().split("n")
with open(r"opinion-lexicon-Englishnegative-words.txt","r") as neg:
  negwords = neg.read().split("n")

Importing libraries to generate and show word clouds.

import matplotlib.pyplot as plt
from wordcloud import WordCloud

Positive Word Cloud

# Choosing the only words which are present in poswords
pos_in_pos = " ".join([w for w in final_list if w in poswords])
wordcloud_pos = WordCloud(
                      background_color='black',
                      width=1800,
                      height=1400
                     ).generate(pos_in_pos)
plt.imshow(wordcloud_pos)
word cloud

The word “good” being the most used word catches our attention. If we look back at the reviews, people have written reviews saying the iPad has a good display, good sound, good software, and hardware.

Negative Word Cloud

# Choosing the only words which are present in negwords
neg_in_neg = " ".join([w for w in final_list if w in negwords])
wordcloud_neg = WordCloud(
                      background_color='black',
                      width=1800,
                      height=1400
                     ).generate(neg_in_neg)
plt.imshow(wordcloud_neg)

 

expensive 2

The words expensive, stuck, struck, disappoint stood out in the negative word cloud. If we look at the context of the word stuck, it says “Though it has just 3 GB RAM, it never gets stuck” which is a positive thing about the device.

So, it’s good to build bigram/trigram word clouds to not miss out on the context.

 

2. Create audible files (Text to Audio)

gTTS is a Python Library with Google Translate’s text-to-speech API.

To install, execute the command “pip install gtts” in the command prompt.

Import necessary libraries

import cv2
import pytesseract
from gtts import gTTS
import os

Set the tesseract path

pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'

Read the image using cv2.imread() and grab the text from the image using pytesseract and store it in a variable.

rev = cv2.imread("Reviews\15.PNG")
# display the image using cv2.imshow() method
# cv2.imshow("Image", rev)
# cv2.waitKey(0)
# cv2.destroyAllWindows()
# grab the text from image using pytesseract
txt = pytesseract.image_to_string(rev)
print(txt)

Set language and create a convert the text to audio using gTTS bypassing the text, language

language = 'en'

outObj = gTTS(text=txt, lang=language, slow=False)

Save the audio file as “rev.mp3”

outObj.save("rev.mp3")

play the audio file

os.system('rev.mp3')
os.system

Complete Code: 

  1. import cv2
    import pytesseract
    from gtts import gTTS
    import os
    rev = cv2.imread("Reviews\15.PNG")
    
    # cv2.imshow("Image", rev)
    # cv2.waitKey(0)
    # cv2.destroyAllWindows()
    
    txt = pytesseract.image_to_string(rev)
    print(txt)
    language = 'en'
    outObj = gTTS(text=txt, lang=language, slow=False)
    outObj.save("rev.mp3")
    print('playing the audio file')
    os.system('rev.mp3')

End Notes

By the end of this article, we have understood the concept of Optical Character Recognition (OCR) and are familiar with reading images using OpenCV and grabbing the text from images using pytesseract. We have seen two basic applications of OCR – Building word clouds, creating audible files by converting text to speech using gTTS.

References: 

I hope this article is informative, and please do let me know if you have any queries or feedback related to this article in the comments section. Happy Learning 😊

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *