Optical Character Recognition(OCR) with Tesseract, OpenCV, and Python
This article was published as a part of the Data Science Blogathon
Intro
We, humans, read text almost every minute of our life. Wouldn’t it be great if our machines or systems could also read the text just like the way we do? But the bigger question is “How do we make our machines read”? This is where Optical Character Recognition (OCR) comes into the picture.
Optical Character Recognition (OCR)
Optical Character Recognition (OCR) is a technique of reading or grabbing text from printed or scanned photos, handwritten images and convert them into a digital format that can be editable and searchable.
Applications
OCR has plenty of applications in today’s business. A few of them are listed below:
- Passport recognition in Airports
- Automation of Data Entry
- License plates recognition
- Extracting business card information into a contact list
- Converting handwritten documents into electronic images
- Creating Searchable PDFs
- Create audible files (text to audio)
Some of the Open Source OCR tools are Tesseract, OCRopus.
In this article, we will focus on Tesseract OCR. And to read the images we need OpenCV.
Installation of Tesseract OCR:
Download the latest installer for windows 10 from “https://github.com/UB-Mannheim/tesseract/wiki“. Execute the .exe file once it is downloaded.
Note: Don’t forget to copy the file software installation path. We will require it later as we need to add the path of the tesseract executable in the code if the directory of installation is different from the default.
The typical installation path in Windows systems is C:Program Files.
So, in my case, it is “C: Program FilesTesseract-OCRtesseract.exe“.
Next, to install the Python wrapper for Tesseract, open the command prompt and execute the command “pip install pytesseract“.
OpenCV
OpenCV(Open Source Computer Vision) is an open-source library for computer vision, machine learning, and image processing applications.
OpenCV-Python is the Python API for OpenCV.
To install it, open the command prompt and execute the command “pip install opencv-python“.
Build sample OCR Script
1. Reading a sample Image
import cv2
Read the image using cv2.imread() method and store it in a variable “img”.
img = cv2.imread("image.jpg")
If needed, resize the image using cv2.resize() method
img = cv2.resize(img, (400, 400))
Display the image using cv2.imshow() method
cv2.imshow("Image", img)
Display the window infinitely (to prevent the kernel from crashing)
cv2.waitKey(0)
Close all open windows
cv2.destroyAllWindows()
2. Converting Image to String
import pytesseract
Set the tesseract path in the code
pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'
The below error occurs if we do not set the path.
To convert an image to string use pytesseract.image_to_string(img) and store it in a variable “text”
text = pytesseract.image_to_string(img)
print the result
print(text)
Complete code:
import cv2 import pytesseract pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe' img = cv2.imread("image.jpg") img = cv2.resize(img, (400, 450)) cv2.imshow("Image", img) text = pytesseract.image_to_string(img) print(text) cv2.waitKey(0) cv2.destroyAllWindows()
The output for the above code:
The output of the above code
If we observe the output, the main quote is extracted perfectly, but the philosopher’s name and the text at the very bottom of the image are not obtained.
To extract the text accurately and to avoid accuracy drop, we need to do some preprocessing of the image. I found this article (https://towardsdatascience.com/pre-processing-in-ocr-fc231c6035a7) quite helpful. Refer to it for a better understanding of preprocessing techniques.
Perfect! Now that we have got the basics required, Let us see some simple applications of OCR.
1. Building word clouds on Review images
Word cloud is a visual representation of word frequency. The bigger the word appears in a word cloud, the more commonly the word is used in the text.
For this, I took some snapshots of reviews from Amazon for the product Apple iPad 8th Generation.
Sample image
Steps:
- Create a list of all the available review images
- If needed view the images using cv2.imshow() method
- Read text from images using pytesseract
- Create a data frame
- Preprocess the text – remove special characters, stop words
- Build positive, negative word clouds
Step 1: Create a list of all the available review images
import os folderPath = "Reviews" myRevList = os.listdir(folderPath)
Step 2: If needed view the images using cv2.imshow() method
for image in myRevList: img = cv2.imread(f'{folderPath}/{image}') cv2.imshow("Image", img) cv2.waitKey(0) cv2.destroyAllWindows()
Step 3: Read text from images using pytesseract
import cv2 import pytesseract pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe' corpus = [] for images in myRevList: img = cv2.imread(f'{folderPath}/{images}') if img is None: corpus.append("Could not read the image.") else: rev = pytesseract.image_to_string(img) corpus.append(rev) list(corpus) corpus
Step 4: Create a data frame
import pandas as pd data = pd.DataFrame(list(corpus), columns=['Review']) data
Step 5: Preprocess the text – remove special characters, stopwords
#removing special characters import re def clean(text): return re.sub('[^A-Za-z0-9" "]+', ' ', text) data['Cleaned Review'] = data['Review'].apply(clean) data
Removing stopwords from the ‘Cleaned Review’ and appending all the remaining words to a list variable “final_list”.
-
# removing stopwords import nltk from nltk.corpus import stopwords nltk.download("punkt") from nltk import word_tokenize stop_words = stopwords.words('english') final_list = [] for column in data[['Cleaned Review']]: columnSeriesObj = data[column] all_rev = columnSeriesObj.values for i in range(len(all_rev)): tokens = word_tokenize(all_rev[i]) for word in tokens: if word.lower() not in stop_words: final_list.append(word)
Step 6: Build positive, negative word clouds
Install word cloud library using the command “pip install wordcloud“.
In the English language, we have a predefined set of positive, negative words called Opinion Lexicons. These files can be downloaded from the link or directly from my GitHub repo.
Once the files are downloaded, read those files in the code and create a list of positive, negative words.
with open(r"opinion-lexicon-Englishpositive-words.txt","r") as pos: poswords = pos.read().split("n") with open(r"opinion-lexicon-Englishnegative-words.txt","r") as neg: negwords = neg.read().split("n")
Importing libraries to generate and show word clouds.
import matplotlib.pyplot as plt from wordcloud import WordCloud
Positive Word Cloud
# Choosing the only words which are present in poswords pos_in_pos = " ".join([w for w in final_list if w in poswords]) wordcloud_pos = WordCloud( background_color='black', width=1800, height=1400 ).generate(pos_in_pos) plt.imshow(wordcloud_pos)

The word “good” being the most used word catches our attention. If we look back at the reviews, people have written reviews saying the iPad has a good display, good sound, good software, and hardware.
Negative Word Cloud
# Choosing the only words which are present in negwords neg_in_neg = " ".join([w for w in final_list if w in negwords]) wordcloud_neg = WordCloud( background_color='black', width=1800, height=1400 ).generate(neg_in_neg) plt.imshow(wordcloud_neg)

The words expensive, stuck, struck, disappoint stood out in the negative word cloud. If we look at the context of the word stuck, it says “Though it has just 3 GB RAM, it never gets stuck” which is a positive thing about the device.
So, it’s good to build bigram/trigram word clouds to not miss out on the context.
2. Create audible files (Text to Audio)
gTTS is a Python Library with Google Translate’s text-to-speech API.
To install, execute the command “pip install gtts” in the command prompt.
Import necessary libraries
import cv2 import pytesseract from gtts import gTTS import os
Set the tesseract path
pytesseract.pytesseract.tesseract_cmd=r'C:Program FilesTesseract-OCRtesseract.exe'
Read the image using cv2.imread() and grab the text from the image using pytesseract and store it in a variable.
rev = cv2.imread("Reviews\15.PNG") # display the image using cv2.imshow() method # cv2.imshow("Image", rev) # cv2.waitKey(0) # cv2.destroyAllWindows() # grab the text from image using pytesseract txt = pytesseract.image_to_string(rev) print(txt)
Set language and create a convert the text to audio using gTTS bypassing the text, language
language = 'en' outObj = gTTS(text=txt, lang=language, slow=False)
Save the audio file as “rev.mp3”
outObj.save("rev.mp3")
play the audio file
os.system('rev.mp3')
Complete Code:
-
import cv2 import pytesseract from gtts import gTTS import os rev = cv2.imread("Reviews\15.PNG") # cv2.imshow("Image", rev) # cv2.waitKey(0) # cv2.destroyAllWindows() txt = pytesseract.image_to_string(rev) print(txt) language = 'en' outObj = gTTS(text=txt, lang=language, slow=False) outObj.save("rev.mp3") print('playing the audio file') os.system('rev.mp3')
End Notes
By the end of this article, we have understood the concept of Optical Character Recognition (OCR) and are familiar with reading images using OpenCV and grabbing the text from images using pytesseract. We have seen two basic applications of OCR – Building word clouds, creating audible files by converting text to speech using gTTS.
References:
- gTTS documentation
- OpenCV documentation
- pytesseract documentation
- Check out the complete Jupyter Notebook from my GitHub repo
I hope this article is informative, and please do let me know if you have any queries or feedback related to this article in the comments section. Happy Learning 😊