Gaurav Sharma — January 19, 2022
Beginner Datasets Project Python

This article was published as a part of the Data Science Blogathon.

Introduction

In my previous article, I discussed three python projects with codes and explained them in detail. Also gave you some examples which you can try. All these projects were beginner-friendly. This time, we will look at some more python projects with codes again. And the more projects you will make, the more you will get better in the programming and the language.

Python Projects with Codes

Image Source: https://realpython.com

Let’s get started!

1. Text Extraction using OpenCV and OCR

OpenCV is a library of programming functions used mainly for computer vision tasks. With this, you can process images, resize images, object detection, etc. We will see how to extract text in a snap using contours.

Install these:

pip install pytesseract

pip install opencv-python

Python-tesseract is Google’s Tessaract-OCR engine used to get text from images. You will need this to execute a tesseract file and Download it from here.

Now let’s begin with the text extractions step by step:

1. Convert the image to Gray using cv2.COLOR_BGR2GRAY.

cv2.cvtColor(input_image, cv2.COLOR_BGR2GRAY)

2. Finding contours in the image:

To find contours use cv2.findContours().  It takes three parameters: the source image, contour retrieval mode, contour approximation method. This will return a python list of all contours. Contour is nothing but a NumPy array of (x,y) coordinates of boundary points in the object.

3. Apply OCR.

By looping through each contour, take x,y and width, height using cv2.boundingRect() function. Then draw a rectangle function in image using cv2.rectange(). This has five parameters: input image, (x, y), (x+w, y+h), boundary colour for rectangle, size of the boundary.

4. Crop the rectangular region and pass that to tesseract to extract text. Save your content in a file by opening it in append mode.

For more details, go through code comments also.

Code:

import cv2
import pytesseract
# path to Tesseract-OCR in your computer
pytesseract.pytesseract.tesseract_cmd = 'path_to_tesseract.exe'
img = cv2.imread("input.png") #input image
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)    # Converting image to gray scale
# performing OTSU threshold
ret, img_thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_OTSU | cv2.THRESH_BINARY_INV)
# give structure shape and kernel size
# kernel size increases or decreases the area of the rectangle to be detected.
rect_kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (18, 18))
#dilation on the threshold image
dilation = cv2.dilate(img_thresh , rect_kernel, iterations = 1)
img_contours, hierarchy = cv2.findContours(dilation, cv2.RETR_EXTERNAL,
                                                cv2.CHAIN_APPROX_NONE)
im2 = img.copy()
file = open("Output.txt", "w+") #text file to save results
file.write("")
file.close()
#loop through each contour
for contour in img_contours:
    x, y, w, h = cv2.boundingRect(contour)
    rect = cv2.rectangle(im2, (x, y), (x + w, y + h), (0, 255, 0), 2)
    cropped_image = im2[y:y + h, x:x + w] #crop the text block
    file = open("Output.txt", "a")
    text = pytesseract.image_to_string(cropped_image) #applying OCR
    file.write(text)
    file.write("n")
    file.close()

Input image:

Python Projects with CodesOutput image:

Python Projects with Codes

2. Convert your PDF File to Audio Speech

Say you have some book as PDF to read, but you are feeling too lazy to scroll; how good it would be then if that PDF is converted to an audiobook. So, let’s implement this using python.

We will need these two packages:

pyttsx3: It is for Text to Speech, and it will help the machine speak.

PyPDF2: It is a PDF toolkit. It is capable of extracting document information, merging documents, etc.

Install them using these commands:

pip install pyttsx3
pip install PyPDF2

Steps:

  • Import the required modules.
  • Use PdfFileReader() to read PDF file.
  • getPage() method is used to select the page to be read from.
  • Extract the text using extract text().
  • By using pyttx3, speak out the text.

Code:

# import the modules
import PyPDF2
import pyttsx3
  
# path of your PDF file
path = open('Book.pdf', 'rb')
  
# PdfFileReader object
pdfReaderObj = PyPDF2.PdfFileReader(path)
  
# the page with which you want to start
from_page = pdfReaderObj.getPage(12)
content = from_page.extractText()
# reading the text
speak = pyttsx3.init()
speak.say(content)
speak.runAndWait()

That’s it! It will do the job. This small code is beneficial to you when you don’t want to read; you can hear.

Next, you can provide a GUI to this project using tikinter or anything else. You can give a GUI to enter the pdf path, the page number to start from, a stop button. Try this!

Let’s move to the next project.

3. Reading mails and downloading attachments from the mailbox

Let’s understand what the benefit of reading the mailbox with Python is. So, let’s suppose if we are working on a project where some data comes daily in word or excel, which is required for the script as input or to Machine learning model as input. So, if you have to download this data file daily and give it to the hand, it will be hectic. But if we can automate this step, read this file, and download the required attachment, it would be a great help. So, let’s implement this.

We will use pywin32 to implement automatic attachment download from a particular mail. It can access Windows applications like Excel, PowerPoint, Word, Outlook, etc., to perform some actions. We will focus on Outlook and download attachments from the outlook mailbox.

Note: This does not need authentication like user email id or password. It can access Outlook that is already logged in to your machine. (Keep the outlook app open while running the script).

In the above example, we chose smtplib because it can only send emails and not download attachments. So, we will go with pywin32 to download attachments from Outlook, and it will be pretty straightforward. Let’s look at the code.

Command to install: pip install pywin32

Import module

import win32com.client

Now, establish a connection to Outlook.

outlook = win32com.client.Dispatch(“Outlook.Application”).GetNamespace(“MAPI”)

Let’s try to access Inbox:

inbox = outlook.GetDefaultFolder(number)

This function takes a number/integer as input which will tell the index of the inbox folder in our outlook app.

To check the index of all folders, just run this code snippet:

import win32com.client
outlook=win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
for i in range(50):
  try:
    box = outlook.GetDefaultFolder(i)
    name = box.Name
    print(i, name)
  except:
    pass

Output:

3 Deleted Items
4 Outbox
5 Sent Items
6 Inbox
9 Calendar

As you can see in the output Inbox index is 6. So we will use 6 in the function.

inbox = outlook.GetDefaultFolder(6)

If you want to print the subject of all the emails in the inbox, use this:

messages = inbox.Items
# get the first email
message = messages.GetFirst()
# to loop through all the email in the inbox 
while True:
  try:
    print(message.subject) # get the subject of the email
    message = messages.GetNext() 
  except:
    message = messages.GetNext()

There are other properties also like “message. subject”, “message. senton”, which can be used accordingly.

Downloading Attachment

If you want to print all the names of attachments in a mail:

for attachment in message.Attachments:
    print(attachment.FileName)

Let’s download an attachment (an excel file with extension .xlsx) from a specific sender.

import win32com.client
import re
import os
outlook = win32com.client.Dispatch("Outlook.Application").GetNamespace("MAPI")
inbox = outlook.GetDefaultFolder(6)
messages = inbox.Items
message = messages.GetFirst()
while True:
  try:
    if re.search('Data Report', str(message.Subject).lower()) != None and  re.search("ABC prasad", str(message.Sender).lower()) != None:
      attachments = message.Attachments
      for attachment in message.Attachments:
        if ".xlsx" in attachment.FileName or ".XLSX" in attachment.FileName:
         attachment_name = str(attachment.FileName).lower()
        attachment.SaveASFile(os.path.join(download_folder_path, attachment_name))
    else:
      pass
    message = messages.GetNext()
  except:
    message = messages.GetNext()
exit

Explanation

This is the complete code to download an attachment from Outlook inbox. Inside try block, you can change conditions. For example, I am searching for those mails which have subjects such as Data Report and Sender name “ABC prasad”. So, it will iterate from the first mail in the inbox, and if the condition gets true, it will then look if that particular mail has an attachment with the extension .xlsx or .XLSX. So you can change all these things subject, sender, file type and download the file you want. Once it finds the file, it is saved to a path given as “download_folder_path”.

End Notes

We discussed three projects in a previous article and three in this article. I hope these python projects with codes helped you to polish your skill set. Just do some hands-on and try these; you will enjoy coding them. I hope you find this article helpful. Let’s connect on Linkedin.

Thanks for reading 🙂

Happy coding!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. 

About the Author

Gaurav Sharma

Love Programming, Blog writing and Poetry

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *