Enhancing Scientific Document Processing with Nougat

Mobarak Inuwa 17 Nov, 2023

9 min read

Introduction

In the ever-evolving field of natural language processing and artificial intelligence, the ability to extract valuable insights from unstructured data sources, like scientific PDFs, has become increasingly critical. To address this challenge, Meta AI has introduced Nougat, or “Neural Optical Understanding for Academic Documents,”, a state-of-the-art Transformer-based model designed to transcribe scientific PDFs into a common Markdown format. Nougat was introduced in the paper titled “Nougat: Neural Optical Understanding for Academic Documents” by Lukas Blecher, Guillem Cucurull, Thomas Scialom, and Robert Stojnic.

This sets the stage for a groundbreaking transformation in Optical Character Recognition (OCR) technology and Nougat is the latest addition to Meta AI’s impressive lineup of AI models. In this article, we’ll explore the capabilities of Nougat, understand its architecture, and walk through a practical example of using this model to transcribe scientific documents.

Learning Objectives

Understand Nougat, Meta AI’s latest Transformer model for scientific documents.
Learn how Nougat builds upon its predecessor, Donut, and introduces a state-of-the-art approach to document AI.
Learn Nougat, including its vision encoder, text decoder, and end-to-end training process.
Gain insights into the evolution of OCR technology, from the early days of ConvNets to the transformative power of Swin architectures and auto-regressive decoders.

This article was published as a part of the Data Science Blogathon.

The Birth of Nougat

Nougat is not the first Transformer model in the Meta AI family. It follows in the footsteps of its predecessor, “Donut,” which showcased the power of vision encoders and text decoders in a Transformer-based model. The concept was simple: feed pixels into the model and receive text output. This end-to-end approach removes complex pipelines and proves that attention was all that was required.

Let’s briefly discuss the underlying concept of the “vision encoder, text decoder” paradigm that powers models like Nougat. Donut, the predecessor to Nougat, introduced the ability to combine vision and text processing in a single model. Unlike traditional document processing pipelines, these models operate end-to-end, taking raw pixel data and producing textual content. This approach leverages the attention feature of Transformer architecture for results.

Nougat Takes the Torch

Building upon Donut’s success, Meta AI unleashed Nougat to take OCR to the next level. Like its predecessor, Nougat employs a vision encoder in the form of a Swin Transformer and a text decoder based on mBART. Nougat predicts the markdown of text from the raw pixels of scientific PDFs. This represents a groundbreaking shift towards simplifying the transcription of scientific knowledge into a familiar and Markdown format.

Meta AI saw the vision-text paradigm and applied it to address scientific document challenges. PDFs, while widely adopted, often pose challenges for machines to accurately understand and extract meaningful information from scientific knowledge.

PDFs can be a barrier to effective knowledge retrieval due to the loss of semantic information, especially when dealing with mathematical structures. To bridge this gap, Meta AI introduced Nougat.

Why Nougat?

People have traditionally stored scientific knowledge in books and journals, often in the form of PDFs. However, the PDF format often leads to the loss of critical semantic information, like when it comes to mathematical structures. Nougat fills this gap by performing OCR on scientific documents and converting them into a markup language. This breakthrough harvests scientific knowledge and removes the gap between human-readable documents and machine-readable text.

Nougat successfully transcribes complex scientific documents by reverse engineering an OCR engine and relying on the Transformer architecture. This has opened the door for document AI. Locked away in PDFs, scientific knowledge can now be liberated and processed with Nougat.

The Journey of OCR

Nougat’s journey is a testament to OCR technology. In the late 1980s, applying Convolutional Neural Networks (ConvNets) to OCR was groundbreaking. However, the idea of training an end-to-end system that could read an entire page was nothing more than a dream due to the limitations at the time.

Fast forward to today, where Swin architectures, which combine ConvNets with transformers and auto-regressive decoders, have made it possible to transcribe entire pages. Like Donut, Nougat follows the vision-text paradigm, a Transformer-based image encoder, and an autoregressive text decoder.

Using Nougat: A Practical Example

Now that we’ve explored Nougat let’s dive into a practical example of how to use this powerful model to transcribe scientific PDFs into a standard Markdown format. We’ll walk through the code step by step, providing explanations and insights along the way. The complete code for this article is found here https://github.com/inuwamobarak/nougat.

Set-Up Environment

We will install the libraries. These include pymupdf, which is for reading PDFs as images, and other libraries, python-Levenshtein, and NLTK for post-processing tasks.

!pip install -q pymupdf python-Levenshtein nltk
!pip install -q git+https://github.com/huggingface/transformers.git

Load Model and Processor

In this step, we will load the Nougat model and its associated processor to prepare the model for PDF transcription.

from transformers import AutoProcessor, VisionEncoderDecoderModel
import torch

# Load the Nougat model and processor from the hub
processor = AutoProcessor.from_pretrained("facebook/nougat-small")
model = VisionEncoderDecoderModel.from_pretrained("facebook/nougat-small")

Let us manage memory resources.

%%capture
device = "cuda" if torch.cuda.is_available() else "cpu"
model.to(device)

Now we go on to write a function for rasterizing the pdf paper in the next step.

from typing import Optional, List
import io
import fitz
from pathlib import Path

def rasterize_paper(
    pdf: Path,
    outpath: Optional[Path] = None,
    dpi: int = 96,
    return_pil=False,
    pages=None,
) -> Optional[List[io.BytesIO]]:
    """
    Rasterize a PDF file to PNG images.

    Args:
        pdf (Path): The path to the PDF file.
        outpath (Optional[Path], optional): The output directory. If None, the PIL images will be returned instead. Defaults to None.
        dpi (int, optional): The output DPI. Defaults to 96.
        return_pil (bool, optional): Whether to return the PIL images instead of writing them to disk. Defaults to False.
        pages (Optional[List[int]], optional): The pages to rasterize. If None, all pages will be rasterized. Defaults to None.

    Returns:
        Optional[List[io.BytesIO]]: The PIL images if `return_pil` is True, otherwise None.
    """

    pillow_images = []
    if outpath is None:
        return_pil = True
    try:
        if isinstance(pdf, (str, Path)):
            pdf = fitz.open(pdf)
        if pages is None:
            pages = range(len(pdf))
        for i in pages:
            page_bytes: bytes = pdf[i].get_pixmap(dpi=dpi).pil_tobytes(format="PNG")
            if return_pil:
                pillow_images.append(io.BytesIO(page_bytes))
            else:
                with (outpath / ("%02d.png" % (i + 1))).open("wb") as f:
                    f.write(page_bytes)
    except Exception:
        pass
    if return_pil:
        return pillow_images

Load PDF

In this step, we load a sample PDF and use the fitz module to convert it into a list of Pillow images, each representing a page from the PDF. We will use Crouse et al. 2023.

from huggingface_hub import hf_hub_download
from typing import Optional, List
import io
import fitz
from pathlib import Path
from PIL import Image

filepath = hf_hub_download(repo_id="inuwamobarak/random-files", filename="2310.08535.pdf", repo_type="dataset")

images = rasterize_paper(pdf=filepath, return_pil=True)
image = Image.open(images[0])
image

Generate Transcription

In this step, we prepare the image for input into the Nougat model. Custom stopping criteria to control the autoregressive generation process. These criteria determine when the model should stop generating text.

pixel_values = processor(images=image, return_tensors="pt").pixel_values

from transformers import StoppingCriteria, StoppingCriteriaList
from collections import defaultdict

class RunningVarTorch:
    def __init__(self, L=15, norm=False):
        self.values = None
        self.L = L
        self.norm = norm

    def push(self, x: torch.Tensor):
        assert x.dim() == 1
        if self.values is None:
            self.values = x[:, None]
        elif self.values.shape[1] < self.L:
            self.values = torch.cat((self.values, x[:, None]), 1)
        else:
            self.values = torch.cat((self.values[:, 1:], x[:, None]), 1)

    def variance(self):
        if self.values is None:
            return
        if self.norm:
            return torch.var(self.values, 1) / self.values.shape[1]
        else:
            return torch.var(self.values, 1)


class StoppingCriteriaScores(StoppingCriteria):
    def __init__(self, threshold: float = 0.015, window_size: int = 200):
        super().__init__()
        self.threshold = threshold
        self.vars = RunningVarTorch(norm=True)
        self.varvars = RunningVarTorch(L=window_size)
        self.stop_inds = defaultdict(int)
        self.stopped = defaultdict(bool)
        self.size = 0
        self.window_size = window_size

    @torch.no_grad()
    def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor):
        last_scores = scores[-1]
        self.vars.push(last_scores.max(1)[0].float().cpu())
        self.varvars.push(self.vars.variance())
        self.size += 1
        if self.size < self.window_size:
            return False

        varvar = self.varvars.variance()
        for b in range(len(last_scores)):
            if varvar[b] < self.threshold:
                if self.stop_inds[b] > 0 and not self.stopped[b]:
                    self.stopped[b] = self.stop_inds[b] >= self.size
                else:
                    self.stop_inds[b] = int(
                        min(max(self.size, 1) * 1.15 + 150 + self.window_size, 4095)
                    )
            else:
                self.stop_inds[b] = 0
                self.stopped[b] = False
        return all(self.stopped.values()) and len(self.stopped) > 0

outputs = model.generate(
    pixel_values.to(device),
    min_length=1,
    max_length=3584,
    bad_words_ids=[[processor.tokenizer.unk_token_id]],
    return_dict_in_generate=True,
    output_scores=True,
    stopping_criteria=StoppingCriteriaList([StoppingCriteriaScores()]),
)

Postprocessing

Finally, we decode the generated token IDs into human-readable text and apply post-processing steps to refine the generated Markdown content. The resulting output represents the transcribed content from the scientific PDF.

generated = processor.batch_decode(outputs[0], skip_special_tokens=True)[0]

generated = processor.post_process_generation(generated, fix_markdown=False)
print(generated)

The generated output comes in the form of a Markdown:

That’s how to run an inference with Nougat. It is easy to extract this bunch of text markdown. You can find the complete code for this article here https://github.com/inuwamobarak/nougat. Other links are available for you to look at at the end of the article.

Performance Metrics

A range of metrics was used to assess the performance of Nougat on a test set. These metrics provide a comprehensive view of Nougat’s capabilities in transcribing scientific PDFs into Markdown format.

Edit Distance

The Edit Distance (Levenshtein Distance) quantifies the number of characters to change one string into another. It encompasses insertions, deletions, and substitutions. The normalized edit distance was used to evaluate Nougat, dividing the calculated distance by the total number of characters. This metric provides insights into how accurately Nougat transcribes content, accounting for the intricacies of scientific documents.

BLEU Score

This is a metric initially designed for evaluating machine translation quality, the BLEU (Bilingual Evaluation Understudy) metric aligned between the candidate text generated by Nougat and the reference text. It computes a score based on the number of matching n-grams between the two texts. This shows how Nougat captures the essence of the original content and n-gram similarities.

METEOR Score

Another notable machine-translating metric, METEOR, takes recall over precision. While it is not the regular choice for OCR evaluation, it provides a unique perspective on how Nougat retains the core content and the source material. METEOR, like BLEU, aids in assessing the quality of the transcribed text.

F-measure

The F1 score combines the precision and recall of Nougat’s transcription. It is a balanced perspective on the model’s performance, taking its ability to capture content and retain meaningful information accurately.

Possible Applications of Nougat Beyond Academic Documents

While Nougat has been primarily designed for transcribing academic documents, its applications extend far beyond. Here are some potential areas where Nougat can make a significant impact:

Medical Documents

Nougat can be employed to transcribe medical records and clinical notes. This can aid in digitizing healthcare information and information retrieval for medical professionals.

Legal Documents

Legal documents, contracts, and court documents commonly exist in PDF format. Nougat can facilitate the transformation of these documents into machine-readable text, streamlining legal processes and research.

Specialized Fields

Nougat’s adaptability allows it to be used in specialized fields like engineering, finance, and more. It can convert technical reports, financial statements, and other domain-specific documents.

Nougat is a milestone in document AI, a practical and efficient solution for transcribing scientific PDFs into a machine-readable Markdown format. Its contributions to document AI are a glimpse into a future where information retrieval is more efficient.

The Future of Scientific Text Recognition

Nougat is always used in the VisionEncoderDecoder, mirroring the architecture of Donut. Images are fed into the model, and Nougat’s VisionEncoderDecoder generates text autoregressively. The NougatImageProcessor class handles image preprocessing, and NougatTokenizerFast decodes the generated target tokens into the target string. The NougatProcessor combines these classes for feature extraction and token decoding.

This capability is cutting-edge and adapt more soon. Nougat represents document AI. A solution for transcribing scientific PDFs into machine-readable Markdown format. As this model continues to gain traction, it has the potential to revolutionize the way researchers and academics interact with scientific literature, making knowledge more readily available and usable in the digital age.

Conclusion

Nougat is more than just a sweet addition to the Meta AI family; it’s a revolutionary step in the world of OCR for scientific documents. Its ability to convert complex PDFs into Markdown text is a game-changer for getting scientific knowledge. As technology continues to grow, Nougat’s impact will resonate in AI, document processing, and beyond.

In a world where access to knowledge is paramount, Nougat is a powerful tool for unlocking the wealth of information stored in scientific PDFs, bridging the gap between human-readable documents and machine-readable text. Its contributions to document AI are a glimpse into a future where information retrieval is more efficient than ever.

Key Takeaways

Nougat is Meta AI’s cutting-edge OCR model for transcribing scientific PDFs into a user-friendly Markdown format.
The model combines a Swin Transformer vision encoder and an mBART-based text decoder, allowing it to work end-to-end.
It shows transformer architecture in simplifying complex tasks like scientific document transcription.
The evolution of OCR technology, from early ConvNets to modern Swin architectures and auto-regressive decoders, has paved the way for Nougat’s capabilities.

Frequently Asked Questions

Q1: What is Nougat, and how does it differ from traditional OCR systems?

A: Nougat is a state-of-the-art OCR model by Meta AI, designed explicitly for scientific PDFs. Unlike traditional OCR systems, Nougat’s use of the Transformer architecture enables it to simplify the entire transcription process by working end-to-end.

Q2: How does Nougat contribute to scientific knowledge?

A: Nougat’s ability to transcribe scientific PDFs into a user-friendly Markdown format makes it easier for researchers, students, and AI systems to access and process scientific information, bridging the gap between human-readable and machine-readable content.

Q3: What is the architecture?

A: A Swin Transformer vision encoder and an mBART-based text decoder. These convert PDF images into readable text, eliminating the need for sophisticated pipelines.

Q4: How has OCR technology evolved, and how does it fit into this evolution?

A: OCR technology has come a long way, from early ConvNets to Swin architectures and auto-regressive decoders. Nougat represents a modern solution that leverages these advancements to achieve impressive results in document transcription.

Q5: Is Nougat available for public use, and how can it be integrated into existing systems?

A: Meta AI provides the VisionEncoderDecoder for integrating specific implementation details into existing systems, designed to acquire scientific knowledge using Nougat.