Revolutionizing Document Processing Through DocVQA

Chetan Khadke 16 Mar, 2023 • 8 min read


DocVQA (Document Visual Question Answering) is a research field in computer vision and natural language processing that focuses on developing algorithms to answer questions related to the content of a document, like a scanned document or an image of a text document. Unlike other types of visual question answering, where the focus is on answering questions related to images or videos, DocVQA is focused on understanding and answering questions based on the text and layout of a document. The main challenge in DocVQA is understanding the document’s context with layout and formatting to answer the questions accurately.

 Cytonn Photography - unsplash

Learning Objectives:

In this article, we will understand the following:

  1. What is DocVQA, and what are its benefits?
  2. What are the challenges faced by using DocVQA?
  3. Discuss the related work of DocVQA.
  4. Understanding the basics of LayoutLM and Flan-T5 model. and learning its installation process.

Table of Contents

Benefits Offered by DocVQA

DocVQA offers several benefits compared to OCR (Optical Character Recognition) technology.

Firstly, DocVQA can not only recognize and extract text from a document, but it can also understand the context in which the text appears. This means it can answer questions about the document’s content rather than simply provide a digital version.

Secondly, DocVQA can handle documents with complex layouts and structures, like tables and diagrams, which can be challenging for traditional OCR systems.

Finally, DocVQA can automate many document-based workflows, like document routing and approval processes, to make employees focus on more meaningful work. The potential applications of DocVQA include automating tasks like information retrieval, document analysis, and document summarization.


Challenges Associated with DocVQA

There are several issues and challenges associated with document question answering, including:

  1. Understanding the Context: One of the biggest challenges in document question answering is understanding the context of the document. It is essential to understand the layout, formatting, and language used in the document to answer the questions accurately. This requires models that can handle the document’s structure and content complexity.
  2. Ambiguity: Another significant issue in document question answering is ambiguity. Documents may contain ambiguous or vague language, making it difficult for models to interpret the meaning of the text accurately. This requires models that can handle the nuances of natural language and distinguish between different definitions of the same word or phrase.
  3. Limited Training Data: There need to be large-scale annotated datasets for document question answering, making it challenging to train accurate models. This requires models that can learn from limited amounts of training data and can generalize to new documents.
  4. Complex Questions: Document question answering may involve difficult questions requiring accurate reasoning and inference. For example, a query may require combining information from different document parts to arrive at the answer. This requires models that can perform complex reasoning tasks.
  5. Multi-modal Understanding: Some documents may contain text and images, making it essential for models to have multi-modal understanding capabilities to answer questions accurately.

Addressing these challenges requires developing robust and accurate models that can handle the complexity of document question answering. Recent advancements in deep learning and natural language processing have led to significant progress in this field, but there is still much work to be done to develop models that can handle the diversity and complexity of real-world documents.

DocVQA is a technology developed by several companies and research institutions. The most notable companies working on DocVQA technology include Google, Microsoft, IBM, and Amazon.

Google has document AI, Microsoft has the LayoutLM model, IBM has developed a DocVQA system called the IBM Watson Discovery service, and Amazon has also developed an Amazon Textract service, which can extract text and data from scanned documents, PDFs, and images using machine learning.

In this blog, we will discuss LayoutLM and Flan-T5 model.


LayoutLM is a pre-trained model for document image understanding developed by Microsoft Research. It is based on the BERT architecture and trained on a large-scale document image dataset to understand document layout, structure, and content.

The modifications made to the BERT architecture for LayoutLM include a new input encoding scheme that can handle both text and image features and a new multi-task learning objective that jointly optimizes for both language and layout understanding.

The input encoding scheme of LayoutLM involves dividing the document image into small patches and encoding each patch using a combination of image-based features, like color histograms and edge maps, and language-based features, like token embeddings and position embeddings.

LayoutLM can be used for various document understanding tasks, including document classification, information extraction, and visual question answering (VQA). In particular, it has shown promising results in the field of DocVQA.

Several research studies have shown that LayoutLM outperforms other state-of-the-art VQA models on benchmark datasets for DocVQA, indicating its potential for practical applications in document understanding.

The UBIAI[8] tool is available for custom training on any dataset which supports LayoutLM extensively. We will demo the already available trained model from HuggingFace Hub with the FUNSD dataset.

The Form Understanding in Noisy Scanned Documents (FUNSD) dataset is a benchmark dataset for form understanding and analysis in the domain of noisy scanned documents. It contains 199 real-world scanned document forms and is designed to challenge models in understanding and extracting information from these types of documents.

The FUNSD dataset includes various types of information, like questions, answer and is annotated at both the block and token levels. The annotations are provided in a standard format, including both the ground truth labels and the positions of the blocks and tokens.

The annotations in the FUNSD dataset have been used to train and evaluate state-of-the-art models for form understanding, like LayoutLM and other pre-trained language models. The publicly available dataset is intended to serve as a benchmark for researchers and practitioners interested in form understanding and analysis in the domain of noisy scanned documents.

1. Install all the dependencies and load the funsd dataset from the datasets package.

!pip install transformers
!pip install torch
!pip install datasets
!pip install pillow

from datasets import load_dataset
from PIL import Image

dataset = load_dataset("nielsr/funsd", split="train")
example = dataset[0]
words = example["words"]
boxes = example["bboxes"]

2. Load and open the image

image =["image_path"])

3. Initialize the pipeline with task = “document-question-answering” and model “impira/layoutlm-invoices”

from transformers import pipeline

nlp = pipeline(
    framework = "pt",
    # device=0  # use this for GPU
# model="impira/layoutlm-document-qa"

4. Pre-process the words and bounding boxes

words_bbox = []
for word, bbox in zip(words,boxes):

5. Inference on model

predication = nlp(
    "What is the advised solution?",  # Specify question here
    word_boxes = words_bbox

It will generate the answer with text span(stand and end token index) and prediction probability.

Depending upon the document, asking a proper set of questions yields the appropriate response with high confidence.


1. Model choice depends upon the Document structure and complexity

2. “impira/layoutlm-document-qa” works well with structured documents

3. Inference on CPU might be slower; consider using the GPU for faster processing


In 2022, google published the paper titled “Scaling Instruction-Finetuned Language Models” which released multiple checkpoints for Flan-T5.

  • FLAN stands for “Fine-tuned LAnguage Net”
  • T5 stands for “Text-To-Text Transfer Transformer”

FLAN-T5 is better at all tasks with the same number of parameters; these models have been fine-tuned on more than 1000 additional tasks covering additional languages and fine-tuned on chain-of-thought data.

Chain of Thought (CoT) prompting is a recently developed prompting method that encourages the LLM to explain its reasoning.


In the chain of thought prompting, the model is prompted to produce intermediate reasoning steps before giving the final answer to a multi-step problem. The idea is that a model-generated chain of thought would decompose the entire issue into smaller chunks and can produce better results.

It works well on most natural language processing tasks, like language translation, text classification, and question answering. The model is speedy and efficient and can be incorporated into real-time applications. Additionally, FLAN-T5 is highly customizable to fine-tune to meet specific custom tasks.

Google’s Flan-T5 is available via five pre-trained checkpoints:

  • Flan-T5-small : 80M Parameters (~ 308 MB size)
  • Flan-T5-base : 250M Parameters (~ 990 MB size)
  • Flan-T5-large : 780 M Parameters (~ 3GB size)
  • Flan-T5-XL: 3B Parameters (~11 GB size)
  • Flan-T5 XXL: 11B Parameters (~ 45 GB size)

We will demonstrate with DocVQA example.

We consider the sample invoice from Mendeley Data. Check out the sample invoice.


We consider auto-extracting multiple fields like Seller name and address, client name and address, invoice number and date, etc.

1. Install the packages

!pip install transformers PyPDF2

2. Read the document

from PyPDF2 import PdfReader
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

reader = PdfReader("/content/invoice_107_charspace_108.pdf")
pdf_text = ""
print("Total pages=",len(reader.pages)) 
page_numbers_to_read = [0] # Specify page number

# Read the given pages
for page in page_numbers_to_read:
    page_text = reader.pages[page].extract_text()
    pdf_text += page_text

3. Load and Inference the model

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
model = AutoModelForSeq2SeqLM.from_pretrained("google/flan-t5-large").to(device)

def query_from_list(query, options, tok_len):
    t5query = f"""Question: "{query}" Context: {options}"""
    inputs = tokenizer(t5query, return_tensors="pt").to(device)
    outputs = model.generate(**inputs, max_new_tokens=tok_len)
    return tokenizer.batch_decode(outputs, skip_special_tokens=True)

4. Specify and process the questions.

# specify you question here. 
questions = [
    "What is the Invoice no?",
    "What is the Invoice issue Date?",
    "What is the Seller name?",
    "What is the Client name?",
    "What is the Client Address?",
    "What is the Seller Address?",
    "What is the Total Net worth?",
    "What is the total Growth worth amount?"
# Call the LLM with input data and instruction
results = {question:query_from_list(question,  input_data, 30)  for question in questions}
{'What is the Invoice no?': ['82545881'],
 'What is the Invoice issue Date?': ['09/25/2011'],
 'What is the Seller name?': ['Campbell, Callahan and Gomez'],
 'What is the Client name?': ['Keller-Crosby'],
 'What is the Client Address?': ['Keller-Crosby 280 Kim Valleys Suite 217 Angelaburgh, DE 97356'],
 'What is the Seller Address?': ['2969 Todd Orchard Apt. 721 Port James, FL 83598'],
 'What is the Total Net worth?': ['221,70'],
 'What is the total Growth worth amount?': ['243,87']


1. The flan-T5-large model size is around ~3GB and only consumes 5.7 GB of RAM.

 Colab Ram utilization

2. CPU inference time for the Flan-T5-large model is relatively low, around 1–1.5 min for eight questions.

3. The model yields excellent results if a proper prompt is available.

4. Flan-T5-XL and Flan-T5-XXL are capable of simple tabular question answering.

6. Flan-T5-XL works well and understands the context well for DocVQA. Apart from general key-value, it can extract context-based answers properly, i.e., Seller Address, Invoice Date(in case of multiple dates available), etc.

7. FLAN-T5 is capable of solving math problems when giving reasoning.

8. OCR-post processing plays a crucial role in final extraction. Consider using a decent OCR mechanism for better results.

9. Consider using Higher-RAM and Premium GPU for FLAN-T5-XXL.


To experiment with Flan-T5-XL and Flan-T5-XXL, you might require the Google Colab-Pro version. Please consider the following setup while running on Colab ( Menu -> Runtime -> Change runtime type).

Model Name Hardware acceleration GPU Class Runtime Shape
FLAN-T5-Large GPU/CPU Standard Standard
FLAN-T5-XL GPU Standard High-RAM
FLAN-T5-XXL GPU Premium High-RAM


In conclusion, Document Visual Question Answering (DocVQA) is an emerging field of research that aims to understand the content of documents and answer questions about them. DocVQA requires analyzing the visual content of documents, like text, images, and other visual elements, and processing natural language questions to generate relevant answers. There are several challenges to be addressed in DocVQA, like dealing with the complexity of document structures, text variations, and language nuances. However, advancements in deep learning and transformers architectures improved DocVQA performance. DocVQA has several practical applications, like automating customer service and improving information retrieval systems. It can also revolutionize the legal, financial, and medical domains by automating the processing and analysis of documents.

Overall, DocVQA is an exciting and rapidly evolving field that holds great potential for the future of document processing and natural language understanding. As technology advances, we can expect to see more sophisticated and effective DocVQA systems that can handle complex document structures and answer a wide range of questions accurately and efficiently.

Specific fine-tuning requires handling the larger model with limited resources. Please get in touch for more details on Linkedin.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Chetan Khadke 16 Mar 2023

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers