Chetan Khadke — Published On April 26, 2023
Advanced Deep Learning NLP Python Technique Unstructured Data


Document information extraction involves using computer algorithms to extract structured data (like employee name, address, designation, phone number, etc.) from unstructured or semi-structured documents, such as reports, emails, and web pages. The extracted information can be used for various purposes, such as analysis and classification. DocVQA(Document Visual Question Answering) is a cutting-edge approach combining computer vision and natural language processing techniques to automatically answer questions about a document’s content.  This article will explore information extraction using DocVQA with Google’s Pix2Struct package.

Learning Objectives

  1. DocVQA usefulness across diverse domains
  2. Challenges and Related Work of DocVQA
  3. Comprehend and implement Google’s Pix2Struct technique
  4. The vital benefit of the Pix2Struct technique

This article was published as a part of the Data Science Blogathon.

Table of Contents

DocVQA Use Case

Document extraction automatically extracts relevant information from unstructured documents, such as invoices, receipts, contracts, and forms. The following sector will get benefited because of this:

  1. Finance: Banks and financial institutions use document extraction to automate tasks such as invoice processing, loan application processing, and account opening. By automating these tasks, document extraction can reduce errors and processing times and improve efficiency.
  2. Healthcare: Hospitals and healthcare providers use document extraction to extract essential patient data from medical records, such as diagnosis codes, treatment plans, and test results. This can help streamline patient care and improve patient outcomes.
  3. Insurance: Insurance companies use document extraction to process claims, policy applications, and underwriting documents. Document extraction can reduce processing times and improve accuracy by automating these tasks.
  4. Government: Government agencies use document extraction to process large volumes of unstructured data, such as tax forms, applications, and legal documents. By automating these tasks, document extraction can help reduce costs, improve accuracy, and improve efficiency.
  5. Legal: Law firms and legal departments use document extraction to extract critical information from legal documents, such as contracts, pleadings, and discovery documents. It will improve efficiency and accuracy in legal research and document review.

Document extraction has many applications in industries that deal with large volumes of unstructured data. Automating document processing tasks can help organizations save time, reduce errors, and improve efficiency.


There are several challenges associated with document information extraction. The major challenge is the variability in document formats and structures. For example, different documents may have various forms and layouts, making it difficult to extract information consistently. Another challenge is noise in the data, such as spelling errors and irrelevant information. This can lead to inaccurate or incomplete extraction results.

The process of document information extraction involves several steps.

  • Document understanding
  • Preprocess the documents, which involves cleaning and preparing the data for analysis. Preprocessing can include removing unnecessary formatting, such as headers and footers, and converting the data into plain text.
  • Extract the relevant information from the documents using a combination of rule-based and machine-learning algorithms. Rule-based algorithms use a set of predefined rules to remove specific types of information, such as names, dates, and addresses.
  • Machine learning algorithms use statistical models to identify patterns in the data and extract relevant information.
  • Validate and refine the extracted information. It involves checking the extracted information’s accuracy and making necessary corrections. This step is vital to ensure the extracted data is accurately reliable for further analysis.

Researchers are developing new algorithms and techniques for document information extraction to address these challenges. These include techniques for handling variability in document structures, such as using deep learning algorithms to learn document structures automatically. They also include techniques for handling noisy data, such as using natural language processing techniques to identify and correct spelling errors.

DocVQA stands for Document Visual Question Answering. It is a task in computer vision and natural language processing that aims to answer questions about the content of a given document image. The questions can be about any aspect of the document text. DocVQA is a challenging task because it requires understanding the document’s visual content and the ability to read and comprehend the text in it. This task has numerous real-world applications, such as document retrieval, information extraction, etc.

LayoutLM, Flan-T5, and Donut

LayoutLM, Flan-T5, and Donut are three approaches to document layout analysis and text recognition for Document Visual Question Answering (DOCVQA).

It is a pre-trained language model incorporating visual information such as document layout, OCR text positions, and textual content. LayoutLM can be fine-tuned for various NLP tasks, including DOCVQA. For example, LayoutLM in DOCVQA can help accurately locate the document’s relevant text and other visual elements, which is essential for answering questions requiring context-specific information.

Flan-T5 is a method that uses a transformer-based architecture to perform both text recognition and layout analysis. This model is trained end-to-end on document images and can handle multi-lingual documents, making it suitable for various applications. For example, using Flan-T5 in DOCVQA allows for accurate text recognition and layout analysis, which can help improve the system’s performance.

Donut is a deep learning model that uses a novel architecture to perform text recognition on documents with irregular layouts. The use of Donut in DOCVQA can help to accurately extract text from documents with complex layouts, which is essential for answering questions that require specific information. The significant advantage is it is OCR-free.

Overall, using these models in DOCVQA can improve the accuracy and performance of the system by accurately extracting text and other relevant information from the document images. Please check out my previous blogs on DONUTand FLAN -T5 and LAYOUTLM.

Deep learning applications | document information


The paper presents Pix2Struct from Google, a pre-trained image-to-text model for understanding visually-situated language. The model is trained using the novel learning technique to parse masked screenshots of web pages into simplified HTML, providing a significantly well-suited pretraining data source for the range of downstream activities. In addition to the novel pretraining strategy, the paper introduces a more flexible integration of linguistic and visual inputs and variable resolution input representation. As a result, the model achieves state-of-the-art results in six out of nine tasks in 4 domains like documents, illustrations, user interfaces, and natural images. The following image shows the detail about the considered domains. (The picture below is on the 5th page of the pix2struct research paper)

 Pix2Struct paper | document information

Pix2Struct is a pre-trained model that combines the simplicity of purely pixel-level inputs with the generality and scalability provided by self-supervised pretraining from diverse and abundant web data. The model does this by recommending a screenshot parsing objective that needs predicting an HTML-based parse from a screenshot of a web page that has been partially masked. With the diversity and complexity of textual and visual elements found on the web, Pix2Struct learns rich representations of the underlying structure of web pages, which can effectively transfer to various downstream visual language understanding tasks.

Pix2Struct is based on the Vision Transformer (ViT), an image-encoder-text-decoder model. However, Pix2Struct proposes a small but impactful change to the input representation to make the model more robust to various forms of visually-situated language. Standard ViT extracts fixed-size patches after scaling input images to a predetermined resolution. This distorts the proper aspect ratio of the image, which can be highly variable for documents, mobile UIs, and figures.

Also, transferring these models to downstream tasks with higher resolution is challenging, as the model only observes one specific resolution during pretraining. Pix2Struct proposes to scale the input image up or down to extract the maximum number of patches that fit within the given sequence length. This approach is more robust to extreme aspect ratios, common in the domains Pix2Struct experiments with. Additionally, the model can handle on-the-fly changes to the sequence length and resolution. To handle variable resolutions unambiguously, 2-dimensional absolute positional embeddings are used for the input patches.

Pix2Struct Provides Two Models

  • Base model: google/pix2struct-docvqa-base (~ 1.3 GB)
  • Large model: google/pix2struct-docvqa-large (~ 5.4 GB)

The Pix2Struct-Large model has outperformed the previous state-of-the-art Donut model on the DocVQA dataset. The LayoutLMv3 model achieves high performance on this task using three components, including an OCR system and pre-trained encoders. However, the Pix2Struct model performs competitively without using in-domain pretraining data and relies solely on visual representations. (We consider only DocVQA results.)


Let us walk through with implementation for DocVQA. For the demo purpose, let us consider the sample invoice from Mendeley Data.

 Image from Mendeley Data | document information
Image from Mendeley Data

1. Install the packages

!pip install git+ pdf2image
!sudo apt install poppler-utils12diff

2. Import the packages

from pdf2image import convert_from_path, convert_from_bytes
import torch
from functools import partial
from PIL import Image
from transformers import Pix2StructForConditionalGeneration as psg
from transformers import Pix2StructProcessor as psp

3. Initialize the model with pretrained weights

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
model = psg.from_pretrained("google/pix2struct-docvqa-large").to(DEVICE)
processor = psp.from_pretrained("google/pix2struct-docvqa-large")

4. Processing functions

def generate(model, processor, img, questions):
  inputs = processor(images=[img for _ in range(len(questions))], 
           text=questions, return_tensors="pt").to(DEVICE)
  predictions = model.generate(**inputs, max_new_tokens=256)
  return zip(questions, processor.batch_decode(predictions, skip_special_tokens=True))

def convert_pdf_to_image(filename, page_no):
    return convert_from_path(filename)[page_no-1]

5. Specify the exact the path and page number for pdf file.

questions = ["what is the seller name?",
             "what is the date of issue?",
             "What is Delivery address?",
             "What is Tax Id of client?"]
FILENAME = "/content/invoice_107_charspace_108.pdf"

6. Generate the answers

image = convert_pdf_to_image(FILENAME, PAGE_NO)
print("pdf to image conversion complete.")
generator = partial(generate, model, processor)
completions = generator(image, questions)
for completion in completions:
## answers
('what is the seller name?', 'Campbell, Callahan and Gomez')
('what is the date of issue?', '09/25/2011')
('What is Delivery address?', '2969 Todd Orchard Apt. 721')
('What is Tax Id of client?', '941-79-6209')

Try out your example on hugging face spaces.

 HuggingFace space | document information
HuggingFace space

Notebooks: pix2struck notebook


In conclusion, document information extraction is an essential area of research with applications in many domains. It involves using computer algorithms to identify and extract relevant information from text-based documents. Although several challenges are associated with document information extraction, researchers are developing new algorithms and techniques to address these challenges and improve the accuracy and reliability of the extracted information.

However, like all deep learning models, DocVQA has some limitations. For example, it requires a lot of training data to perform well and may need help with complex documents or rare symbols and fonts. It may also be sensitive to the quality of the input image and the accuracy of the OCR (optical character recognition) system used to extract text from the document.

Key Takeaways

  1. The pix2struct works well to understand the context while answering.
  2. The pix2struct is the latest state-of-the-art of model for DocVQA.
  3. No specific external OCR engine is required.
  4. The pix2struct works better as compared to DONUT for similar prompts.
  5. The pix2struct can utilize for tabular question answering.
  6. CPU inference would be slower(~ 1 min/1 question). The larger model can be loaded into 16GB RAM.

To learn more about it, kindly get in contact on Linkedin. Please acknowledge if you are citing this article or repo.



The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

About the Author

Chetan Khadke

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article