Master Guide for Excel Automation Using Python

Kaustubh Gupta 12 May, 2023 • 13 min read

Introduction

Excel is the organization’s go-to tool for one or the other tasks in the workflow. It is the most popular choice among stakeholders. Whether accept it or not, I have seen people rejecting Tableau visuals for Excel. Even in conversations with my current organization’s CEO, he acknowledged that Excel is the only tool that will never vanish. There are various workarounds for adding extensible workflows to enhance the productivity of using excel automation. VBAs, macros, and tools like data-snipper are just the starting point. Openpyxl is an open-source library that brings Excel operations to Python. It provides implementation to manipulate existing as well as create new Excel files on the fly.

Although the use case and scenario where this library can be integrated will be far beyond the scope of this article. Here one can expect an implementation guide starting with the basic I/O of files to build up an automation process of generating reports in a predefined template.

Learning Objectives:

Know about Openpyxl
Understanding the basics of Openpyxl
Exploring various Excel functionalities via Python
Limitations of Openpyxl
Applying the Learning in a toy project

This article was published as a part of the Data Science Blogathon.

Introduction
Why Openpyxl and not Pandas?
Introduction to Openpyxl
Load Workbook Function
Different Worksheets
Adding New Worksheets
Accessing and Modifying Multiple Cells for Excel Automation
Creating New Rows and Columns
Adding Excel Formulas
Merging and Unmerging Cells
Adding Filters to the Table
Adding Charts, Pivot Tables
Adding Images to the Worksheets
Styling Worksheet
Adding Comments for Excel Automation
Scenario-Based Mini Excel Automation
Conclusion

Why Openpyxl and not Pandas?

Most people in the data science field use the Pandas library. The library has a “style” attribute for data frames, use them to beautify the Excel files. It comes in handy when the user doesn’t expect to leverage a lot of Excel functionalities. You can read more about styling data frames from this blog post Style your Pandas DataFrame and Make it Stunning.

By sticking to Pandas alone, a lot of use cases can be achieved using Styler. In fact, Pandas offer openpyxl as an engine while exporting data frames to Excel files. Using standalone Openpyxl opens the door to data not stored natively in Pandas data frames. Getting data connected to sources such as MongoDB, Databricks, or CloudWatch can be directly wrangled and dumped to Excel files in a structured way without Pandas as an intermediate. Add Pivot tables, charts, passwords, and whatnot using Openpyxl!

Introduction to Openpyxl

To set up the library in the local environment, one can use a simple pip command, as with any other Python package. Install the library in a separate environment as a best practice to avoid any dependency conflicts.

pip install openpyxl

There are two ways to start working with Openpyxl:

Create a new Excel file locally in the system and then load it into the code
Create a new Excel file in the code and then save it locally.

Let’s start with the first way. Below is the preview of the file which we will refer:

Sample Excel File | Python | Openpyxl | Excel automation — Sample Excel File

The syntax to load a file in openpyxl is very similar to Pandas.

from openpyxl import load_workbook
wb = load_workbook("basic_sample.xlsx",
                   read_only=False,
                   keep_vba=False,
                   data_only=False,
                   keep_links=True)

Load Workbook Function

The load_workbook function from the openpyxl package can take 4 additional arguments along with the file name to load.

read_only : Loads the file in read-only mode. This can be useful in case the file to be handled is quite large and might be a potential bottleneck for operations to be performed.
keep_vba : This option controls whether VBAs should be retained while loading the file. One can’t trigger any VBA while the file is in python but it can be useful in case VBAs are to be used post python scripting work.
data_only : In case only data is to be loaded from an Excel file, this option can be used to load only the data and no associated formulas would be retained.
keep_links : The Excel can be linked to other Excels as well which acts like a data source. In case retaining those connections is important then this option should be set to True.

Different Worksheets

All the option’s default value is set in the above code. Excel file is a collection of different worksheets, it is important to select the right worksheet to access the desired values. This can be achieved in two ways:

Using the active property of the workbook : If one wants to access the values of a worksheet that was last active in Excel, i.e, the worksheet at which the Excel was closed, this property can be used.
Accessing specific worksheet : The workbook object loaded can be used in a similar way a dictionary object values are accessed, i.e, using the worksheet name as a key.

And to access the cells of these worksheets, the dictionary access method would work and combined with the value property will give the actual value of that specific cell. Let’s try to get the value of the “A1” cell using both worksheet access methods.

wb.active['A1'].value
# Kaustubh

wb['Sheet1']['A1'].value
# Kaustubh

To change the values of the cells, a simple assignment operator works. Let’s change the currently accessed name to something else.

wb['Sheet1']['A1'] = "Funky"
wb.save("new_basic.xlsx")

The save function saves the file in its current state. This is the output of the above code.

Output Excel in a cell | Python | Openpyxl | Excel automation — Output Excel when cell “A1” value is changed

Adding New Worksheets

Excel automation often has different worksheets that are helpful in segregating analysis into isolated places so that these can be used as building blocks in cumulative dashboards. These worksheets can contain new data, pivots, charts, and much more. To add new worksheets to the existing workbooks, the create_sheet function can be used.

Let’s create a new sheet named “aggregated”. To verify that the sheet has been created, the workbook’s “sheetnames” property returns all the sheets of a workbook. And to make this change permanent, the workbook needs to be saved using the save function. The code implementation goes as follows:

wb.create_sheet("aggregated")

print(wb.sheetnames)

# Output: ['Sheet1', 'aggregated']

wb.save("new_basic.xlsx")

Accessing and Modifying Multiple Cells for Excel Automation

It’s time to wrangle the tabular data loaded from the sample file and add some additional values to the existing values. We will discuss the two methods of accessing the cell values and achieve multi-cell values access, let’s look at the 3 possible methods. Before discussing them, let’s understand how the cells are located in Excel.

In Excel, the cells are located in “CharacterNumber” formatting and in rows columns fashion. Some examples are “A1”: A column 1st row, “C67”: C column 67th row, “Z1”: Z column 1st row, etc. It means that every cell has a fixed location. These locations are used in every Excel formula, referencing, and whatnot. Let’s take a look at each one of the access methods.

Scratch Method: This method involves writing the logic to fetch cell locations in raw format without using any functions. By adding 65 to the column integer values and then typecasting them as characters, the character will get transform. Value 65 is the base value as Excel columns start from the capital “A”.
Slicing Method: This method directly returns the cell locations in tuple form by slicing the worksheets. It requires the start and end locations.
In-built function Method: Openpyxl provides functions to iterate over the range of cells provided the rows and columns index values. max_col and max_row control how many cells of the worksheet will return.

Use the above methods for looping. In the below code snippet, all 3 methods return the same result:

# Scratch Method
for row in range(1, 10):
    for col in range(0, 6):
        char = chr(col+65)
        print(wb['Sheet1'][char + str(row)].value, end=' ')
    print('\n')
    
    
# Slicing Method
for row in wb['Sheet1']['A1':'F9']:
    for cell in row:
        print(cell.value, end=' ')
    print('\n')
    
    
# In-build Method
for row in wb['Sheet1'].iter_rows(min_row=1, max_col=6, max_row=9):
    for cell in row:
        print(cell.value, end=' ')
    print('\n')

Let’s try to modify the values of the last column to add the prefix “test_”. The code below is implemented using the slicing method:

for row in wb['Sheet1']['A1':'F9']:
    for cell in row:
        if 'College' in str(cell.value):
            cell.value = 'test_' + cell.value
        print(cell.value, end=' ')
    print('\n')

Resulant data with prefix in the last columnn

Creating New Rows and Columns

There are many instances where new rows and columns need to be added to the existing data. For instance, the currently loaded data misses the headers for the rows. There is also a possibility of adding a new column showing Score %. Let’s make these changes.

To add a new row to the worksheet, the insert_rows function can be used. It takes the index location and the number of rows to be inserted as parameters (by default 1 row). This makes everything moves below by the number of rows inserted. Similarly, the insert_cols function is used for inserting columns at the provided index location. The code below inserts two rows from the start of the sheet and one column at the penultimate location :

wb['Sheet1'].insert_rows(0,2)
wb['Sheet1'].insert_cols(6,1)

Now let’s add the header values to the empty rows.

headers = [
    'Name',
    'Mathematics',
    'Computer Organization',
    'Data Structures and Algorithms',
    'Web Development',
    'Score %',
    'College'
]
for row in wb['Sheet1']['A2':'G2']:
    idx = 0
    for cell in row:
        cell.value = headers[idx]
        idx += 1

Similarly, computations on the existing data can be done to add new data. The following code is calculating the Score % for all the rows and appends those scores in the Score % column.

for row in wb['Sheet1']['B3':'F11']:
    score = 0
    for cell in row[:4]:
        score += int(cell.value)
    row[4].value = round((score/400)*100, 2)

The resultant Excel after all the operations applied so far:

Output Excel with Score % column | Python | Openpyxl | Excel automation — Output Excel with Score % column

Adding Excel Formulas

In the above section, the Score % calculation was done in a very ad-hoc manner and this can be clearly avoided by adding Excel formulas. There are more than 450+ Excel formulas that help to find valuable insights from the data. To calculate Score % in the current case, the average function can be used. This is how the above Score % can be calculated using Excel formulas:

for row in wb['Sheet1']['B3':'H11']:
    row[6].value = f'=AVERAGE({row[0].coordinate}:{row[3].coordinate})'

In the above code, the coordinate property of cells will retrieve the locations of the cells. The results match the calculations done via Python.

Merging and Unmerging Cells

Excel allows users to merge cells to form a bigger cell out of existing cells. This makes sure that the sheet formatting is consistent and that bigger cells occupy fixed space. These mergings are useful in case users want to display headers and create subsections and format sheets. Openpyxl offers two straightforward functions for this functionality: merge_cells and unmerge_cells. Both functions take in the cell range as the parameter and are functions of worksheet clasobject. Two checklist items related to these functions:

After the cells are merged, the merged cell properties and value modification can be done by accessing the first location (top left) of the merged cells.
While using the unmerge_cells function, make sure that the cell range is merged else it will throw an exception.

Continuing the Excel file used in previous sections, let’s merge the cells above the subjects rows and display that combined row as “Subjects”.

wb['Sheet1'].merge_cells('B1:E1')
wb['Sheet1']['B1'].value = 'Subjects'

Adding Filters to the Table

Excel filters are one of the most commonly used features for quick summaries on a subset of data rather than working with whole data. With openpyxl, the Excel files can have the filters already added to the data. One caution is that the filters will not filter the data once added. It will just be present, and the user must interact with the filter once so that it is activated. Let’s add a filter for college:

from openpyxl.worksheet.filters import (
    FilterColumn,
    CustomFilter,
    CustomFilters,
    DateGroupItem,
    Filters,
    )

filters = wb['Sheet1'].auto_filter
filters.ref = "A2:G11"
col = FilterColumn(colId=6)
col.filters = Filters(filter=["test_College2"]) 
filters.filterColumn.append(col)

In the above code:

The auto filter property of the worksheet is used to add a reference to the data range.
In order to select a particular filter value from a column, the FilterColumn class is used. Openpyxl offers other advanced filters as well such as CustomFilter, DateGroupItem, DynamicFilter, ColorFilter, IconFilter and Top10ColorFilter, IconFilter andTop10.
For the FilerColumn type, the filter values need to be passed in list format.
Then the filter is appended to the filter property of the worksheet.

Once the Excel is opened and the filter initial interaction is done, the data is filtered on the selected column filter.

Adding Charts, Pivot Tables

Pivots and charts are essentially the first steps toward summarizing data. One drawback of Openpyxl is it does not support the creation of Pivot tables from the code. It allows loading an existing Pivot table and making changes to that. For instance, if the reference data for Pivot tables are updated, the Pivot tables numbers will be updated once the user opens the updated Excel workbook. Openpyxl does support adding various types of charts such as Area, Bar, Line, Scatter, Pie, etc provided the table range and the other required parameters.

Let’s add a bar chart showing all subjects marks by students.

from openpyxl.chart import BarChart, Series, Reference
chart1 = BarChart()
chart1.type = "col"
chart1.style = 10
chart1.title = "Bar Chart"
chart1.y_axis.title = 'Marks'
chart1.x_axis.title = 'Names'

data = Reference(wb['Sheet1'], min_col=2, min_row=2, max_row=11, max_col=5)
cats = Reference(wb['Sheet1'], min_col=1, min_row=3, max_row=11)
chart1.add_data(data, titles_from_data=True)
chart1.set_categories(cats)
chart1.shape = 4

wb['Sheet1'].add_chart(chart1, "B13")

Here is the breakdown of the code:

Importing BarChart from Openpyxl charts.
Configuring the chart parameters such as Chart and Axis titles.
Adding data and categories values using Reference class and attaching this to chart object using add_data and set_categories methods.
Adding the chart to the worksheet at the desired location.

The resultant graph:

Adding Images to the Worksheets

Apart from adding graphs and charts, Excel supports adding graphics in the form of stock images. These can be useful in formatting the dashboards and tailoring them as per end-user needs. One thing to note here is that Openpyxl does not retain the images when an existing workbook is loaded. The images added via code do retain in Excel. To add images, the Image class from Openpyxl drawing can be used. See an example below to add images:

from openpyxl.drawing.image import Image

excel_logo = Image('Demo_Picture.png')
excel_logo.anchor = 'J4'
excel_logo.width = 130
excel_logo.height = 150
wb['Sheet1'].add_image(excel_logo)

The Image class is just a wrapper on the PIL (Pillow) library. Therefore, all attributes that can be configured in PIL are supported. In the above code, the anchor decides the location of the image, width, and height control the dimensions. The result looks like this:

Styling Worksheet

Almost every formatting option is available in Excel. Whether it’s the range of Fonts, formatting such as Bold, italics, filling color patterns, cell alignment, or adding borders. Openpxl supports all these features with its easy-to-use API over Excel files. So far the demo Excel has undergone a lot of changes and now it’s time to beautify it. The following things are missing from the current Excel:

Borders for Score % Column
Column headers to be bold
Center alignment for text above headers and making it bold
Mark scores less than 30 as red and the rest as green.

See the code implementation below:

from openpyxl.styles import Border, Alignment, Font, Side

font_bold = Font(bold=True)
font_red = Font(color='ffff2e2e')
font_green = Font(color='ff00ff00')

border = Border(left=Side(border_style="thin", color='FF000000'),
                right=Side(border_style="thin", color='FF000000'),
                top=Side(border_style="thin", color='FF000000'),
                bottom=Side(border_style="thin", color='FF000000'))

alignment = Alignment(horizontal='center')

wb['Sheet1']['B1'].alignment = alignment
wb['Sheet1']['B1'].font = font_bold

for row in wb['Sheet1']['A2':'G2']:
    for cell in row:
        cell.font = font_bold

for row in wb['Sheet1']['F3':'F11']:
    for cell in row:
        cell.border = border

for row in wb['Sheet1']['B3':'F11']:
    for cell in row[:4]:
        if int(cell.value) > 30:
            cell.font = font_green
        else:
            cell.font = font_red#import csv

The above code modifies the cell border, font, and alignment properties. See the below output:

Formated Excel using Openpyxl | python — Formated Excel using Openpyxl

Adding Comments for Excel Automation

Excel has a feature where anyone can leave comments on the cells. These comments are helpful when multiple people are working on the same type of analysis. Openpyxl allows adding comments to the cells using the comment property. To illustrate this, let’s add a “need to review” comment on cells where the marks are less than 30.

from openpyxl.comments import Comment
comment = Comment("Need to review", "auto grader")
for row in wb['Sheet1']['B3':'F11']:
    for cell in row[:4]:
        if int(cell.value) < 30:
            cell.comment = comment

This will add a comment on all the cells in the range where the value is less than 30. This can be verified by hovering over the cells where the pinned triangle is added on the top right:

Scenario-Based Mini Excel Automation

Till here, all the essential topics in Excel automation are covered. Here is the repo link to the Excel used and the notebook with all the code covered in this article: Excel automation. To apply all the concepts, here is a small toy example to get started.

Assume you are a Data Analyst working for a University. The University analyses every college assessment in the form of Excel charts. You have decided the automate the whole process. You have prepared one template Excel file that can be used to prefill the data and the charts get auto-updated based on the Pivot table created. The template file looks like this:

Template File for excel automation | python — Template File

Your job is to come up with a script that can

Fill in the scores of students for respective subjects.
Add the Score % column to the existing table and make sure it follows the same formatting and fonts.
Merge the cells above to Display “Unit” in the center.
Highlight low scores cells with Yellow background.

The above-mentioned points are just the starting point. There are endless possibilities to work on this example and make the table more attractive and maybe create an Excel dashboard out of this! The template file can be found in the same repository mentioned above.

Conclusion

Excel automation has its own limitations but ideally, we should aim for shifting from manual work to automation. Openpyxl still has a lot of unexplored options that can help in building more robust workflows. Features such as streaming files from the backend, creating custom filters, and adding password protection are just a few examples. This guide covers the process of reading and manipulating Excel files using Python data structures, and then adding Excel functionalities.

Some Key Takeaways:

Excel automation in Python can present challenges as it requires a deeper understanding of the structure. One can overcome these challenges by using Openpyxl, which allows for easy automation of recurring reporting requirements. Using a templatized Excel sheet, one can quickly fill in data fetched from various sources.
However, not all features of Excel automation are supported via Openpyxl, such as filtering and Pivot tables.
To access advanced features such as running macros and converting Excel to PDFs, one can combine Openpyxl with other libraries such as win32com python client or xlwings for an extended version of Excel automation using Python.

If you want to read/explore every article of mine, then head over to my master article list.

Hope you liked my article on Excel automation using python. For any doubts, queries, or potential opportunities, you can reach out to me via LinkedIn — in/kaustubh-gupta/

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Kaustubh Gupta 12 May 2023

Hi, I am a Python Developer with an interest in Data Analytics and am on the path of becoming a Data Engineer in the upcoming years. Along with a Data-centric mindset, I love to build products involving real-world use cases. I know bits and pieces of Web Development without expertise: Flask, Fast API, MySQL, Bootstrap, CSS, JS, HTML, and learning ReactJS. I also do open source contributions, not in association with any project, but anything which can be improved and reporting bug fixes for them.

Advanced Excel Guide Pandas Python