How to Extract Tabular Data from Doc files Using Python?

Kaustubh Gupta Last Updated : 03 Feb, 2025

5 min read

Data is everywhere, and every action we take creates some form of it, though it’s not always structured. Beginners in data analysis often start with standard formats like CSV or text files, as they are easy to work with using tools like pandas or basic Python file handling. However, real-world data can come in various formats, including documents like Doc files. For example, during an internship assignment, I had to analyze data from a Doc file, which required me to extract tabular data from the doc. In this article, I’ll explain the ETL process for Doc files, how to extract tabular data from doc file using python and the difference between Doc and Docx formats, how to convert doc to docx using Python, and how I created interactive plots from the data.

This article was published as a part of the Data Science Blogathon

Difference Between Doc and Docx
Conversion of Doc to Docx in Python
Reading Docx files in Python
- Extract Tabular Data From Doc Files Using Python
Bonus Step: Plot using Plotly
Frequently Asked Questions

Difference Between Doc and Docx

While dealing with doc files, you will come across these two extensions: ‘.doc’ and ‘.docx’. Both the extensions are used for Microsoft word documents that can be created using Microsoft Word or any other word processing tool. The difference lies in the fact that till word 2007, the “doc” extension was used extensively.

After this version, Microsoft introduced a new extension, “Docx”, which is a Microsft Word Open XML Format Document. This extension allowed files to be smaller, easy to store, and less corrupted. It also opened doors to online tools like Google Sheets which can easily manage these Docx files.

Conversion of Doc to Docx in Python

Today, all the files are by default created with the extension Docx but there are still many old files with Doc extension. A Docx file is a better solution to store and share data but we can’t neglect the data stored in Doc files. It might be of great value. Therefore, to retrieve data from Doc files, we need to convert the Doc file to Docx format. Depending on the platform, Windows or Linux, we have different ways for this conversion.

For Windows

Manually, for a word file to be saved as Docx, you simply need to save the file with the extension “.docx”

We will perform this task using Python. Window’s Component Object Model (COM) allows Windows applications to be controlled by other applications. pywin32 is the Python wrapper module that can interact with this COM and automate any windows application using Python. Therefore, the implementation code goes like this:

from win32com import client as wc
w = wc.Dispatch('Word.Application')
doc = w.Documents.Open("file_name.doc")
doc.SaveAs("file_name.docx", 16)

Breakdown of the code:

First, we are importing the client from the win32com package which is preinstalled module during Python installation.
Next, we are creating a Dispatch object for the Word Application.
Then, we are opening this document and saving it with the Docx extension.

For Linux

We can directly use LibreOffice in-build converter:

lowriter --convert-to docx testdoc.doc

Reading Docx files in Python

Python has a module for reading and manipulating Docx files. It’s called “python-docx”. Here, all the essential functions have been already implemented. You can install this module via pip:

pip install python-docx

I won’t go into detail about how a Docx document is structured but on an abstract level, it has 3 parts: Run, paragraph, and Document objects. For this tutorial, we will be dealing with paragraph and Document objects. Before moving to the actual code implementation, let us see the data will be extracting:

extracting data | Extract tabular data doc

Data in new Docx file

The new Docx file contains the glucose level of a patient after several intervals. Each data row has an Id, Timestamp, type, and glucose level reading. To maintain anonymity, I have blurred out the Patient’s name. Procedure to extract this data:

Extract Tabular Data From Doc Files Using Python

1. Import the module

import docx

2. Create a Docx file document object and pass the path to the Docx file.

Text = docx.Document('file_name.docx')

3. Create an empty data dictionary

data = {}

4. Create a paragraph object out of the document object. This object can access all the paragraphs of the document

paragraphs = Text.paragraphs

5. Now, we will iterate over all the paragraphs, access the text, and save them into a data dictionary

for i in range(2, len(Text.paragraphs)):
    data[i] = tuple(Text.paragraphs[i].text.split('t'))

Here I had to split the text at “t” as if you look at one of the rows, it had the tab separator.

6. Access the values of the dictionary

data_values = list(data.values())

Now, these values are transformed as a list and we can pass them into a pandas dataframe. According to my use case, I had to follow some additional steps such as dropping unnecessary columns and timestamp conversion. Here is the final pandas dataframe I got from the initial Doc file:

There are a lot of things that can be done using the python-docx module. Apart from loading the file, one can create a Docx file using this module. You can add headings, paragraphs, make text bold, italics, add images, tables, and much more! Here is the link to the full documentation of the module.

Bonus Step: Plot using Plotly

The main aim of this article was to show you how to extract tabular data from a doc file into a pandas dataframe. Let’s complete the ELT cycle and transform this data into beautiful visualizations using the Plotly library! If you don’t know, Plotly is an amazing visualization library that helps in creating interactive plots.

These plots don’t require much effort as most of the things can be customized. There are many articles on Analytics Vidhya describing the usage of this library. For my use case, here is the configuration for the plot:

import plotly.graph_objects as go

fig = go.Figure()

fig.add_trace(go.Scatter(x=doc_data.index, 

                         y=doc_data['Historic Glucose (mg/dL)'].rolling(5).mean(),

                         mode='lines',

                         marker=dict(

                             size=20,

                             line_width=2,

                             colorscale='Rainbow',

                             showscale=True,

                        ), 

                         name = 'Historic Glucose (mg/dL)'

                    ))

fig.update_layout(xaxis_tickangle=-45,

                  font=dict(size=15),

                  yaxis={'visible': True},

                  xaxis_title='Dates',

                  yaxis_title='Glucose',

                  template='plotly_dark', 

                  title='Glucose Level Over Time'

                 )

fig.update_layout(hovermode="x")

Checkout this article about the guide to pandas for data science

Conclusion

In this article, I explained what Doc files are, the difference between Doc and Docx file extensions, the conversion of Doc files into Docx files, the process of converting Doc to Docx, loading and manipulating Docx files, and finally, how to load this tabular data into a pandas DataFrame.”

Frequently Asked Questions

Q1. How do I convert a DOC file to DOCX in Python?

Use the pywin32 library to automate Microsoft Word for conversion, or use unoconv or LibreOffice for an open-source solution.

Q2. How to convert TXT to DOCX in Python?

Use the python-docx library to create a DOCX file and add the text content from the TXT file programmatically.

Q3. Can Python parse a Word document?

Yes, Python can parse Word documents using libraries like python-docx for DOCX files or pywin32 for older DOC files.

Kaustubh Gupta

Kaustubh Gupta is a skilled engineer with a B.Tech in Information Technology from Maharaja Agrasen Institute of Technology. With experience as a CS Analyst and Analyst Intern at Prodigal Technologies, Kaustubh excels in Python, SQL, Libraries, and various engineering tools. He has developed core components of product intent engines, created gold tables in Databricks, and built internal tools and dashboards using Streamlit and Tableau. Recognized as India’s Top 5 Community Contributor 2023 by Analytics Vidhya, Kaustubh is also a prolific writer and mentor, contributing significantly to the tech community through speaking sessions and workshops.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.6

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

How to Extract Tabular Data from Doc files Using Python?

Table of contents

Difference Between Doc and Docx

Conversion of Doc to Docx in Python

For Windows

Reading Docx files in Python

Extract Tabular Data From Doc Files Using Python

1. Import the module

2. Create a Docx file document object and pass the path to the Docx file.

3. Create an empty data dictionary

4. Create a paragraph object out of the document object. This object can access all the paragraphs of the document

5. Now, we will iterate over all the paragraphs, access the text, and save them into a data dictionary

6. Access the values of the dictionary

Bonus Step: Plot using Plotly

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

How to Extract Tabular Data from Doc files Using Python?

Table of contents

Difference Between Doc and Docx

Conversion of Doc to Docx in Python

For Windows

Reading Docx files in Python

Extract Tabular Data From Doc Files Using Python

1. Import the module

2. Create a Docx file document object and pass the path to the Docx file.

3. Create an empty data dictionary

4. Create a paragraph object out of the document object. This object can access all the paragraphs of the document

5. Now, we will iterate over all the paragraphs, access the text, and save them into a data dictionary

6. Access the values of the dictionary

Bonus Step: Plot using Plotly

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques