An Introduction to BigBird

Drishti Last Updated : 02 Nov, 2022

7 min read

This article was published as a part of the Data Science Blogathon.

Source: Canva|Arxiv

Introduction

In 2018 GoogleAI researchers developed Bidirectional Encoder Representations from Transformers (BERT) for various NLP tasks. However, one of the key limitations of this technique was the quadratic dependency, due to which the BERT-like model can handle sequences of 512 tokens at a time because of their full attention mechanism. To overcome this, Manzil Zaheer, Guru Guruganesh, Avinava Dubey, et al proposed BigBird having a sparse attention mechanism that can handle sequences of length up to 8x of what was previously possible by similar hardware.

In this article, we will take a look at the proposed work in greater detail.

Now, let’s dive in!

Highlights

BigBird employs a sparse attention mechanism that turns the quadratic dependency of the transformer-based model into a linear one. It is a universal approximator of sequence functions that retain the properties of the quadratic, full-attention model.
With the help of BigBird and its Sparse attention mechanism, the complexity of BERT (O(n²)) is reduced to O(n). As a result, the input sequence limited to 512 tokens is now increased to 4096 tokens (8 * 512). Hence, BigBird can handle longer sequences of length ie. up to 8x of what was previously possible by similar hardware.
The capability to accommodate longer context allows BigBird to perform dramatically better on a variety of NLP tasks, including question answering and summarising.

What is the Impact of the Self-Attention Mechanism in Transformers?

The key advancement in Transformers includes a self-attention mechanism, which can be estimated in parallel for each token of the input sequence, eliminating the sequential dependency in recurrent neural networks (like LSTM). This parallelism enables Transformers to leverage the full potential of contemporary SIMD hardware accelerators like GPUs and TPUs, hence facilitating the training of NLP models on datasets of unprecedented size. Pre-training transformers on a large-scale dataset have led to significant improvement in low data regime downstream tasks and tasks with sufficient data and thus has been a major force behind the widespread use of transformers in contemporary NLP.

The self-attention mechanism solves constraints related to the sequential nature of RNNs by enabling each token in the input sequence to attend independently to every other token in the sequence. However, the full self-attention have high computational and memory requirement that is quadratic in the sequence length. Moreover, it was observed that while the corpus size can be huge, the sequence length, which provides the context is minimal. Using currently available hardware and model sizes, input sequences of length 512 tokens can be handled simultaneously. This limits its direct applicability to tasks that require a larger context, like question-answering (QA), document classification, etc.

Figure 1: Diagram illustrating Full all-pair attention, which is obtained by direct matrix multiplication between the query
and key matrix.

Why Did We Need a BigBird-like Model?

As we briefly discussed in the prior sections, transformer-based models like BERT have a core limitation: the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. Consequently, quadratic dependency on the sequence length limits the context size of the model.

These limitations lead us to two questions: 1) Can we obtain the empirical advantages of a fully quadratic self-attention scheme with fewer inner products? 2) Do the sparse attention mechanisms sustain the expressivity and adaptability of the original network?

BigBird addresses the aforementioned problems by using a sparse attention mechanism that scales linearly. As a result, contexts can be drastically scaled up from 512 tokens (present in most BERT models) to 4,096 in BigBird. This is especially helpful in many tasks where long dependencies need to be preserved eg. text summarization.

BigBird Architecture

The authors drew inspiration from graph sparsification methods and studied where the proof for the expressiveness of Transformers fails when full attention is relaxed to form the proposed attention pattern. This knowledge helped in developing BIGBIRD.

BigBird is a sparse-attention-based transformer that extends transformer-based models like BERT to 8 times longer sequences so that empirical advantages of a fully quadratic self-attention scheme are retained with fewer computations.

The building blocks of the sparse-attention mechanism used in BIGBIRD are as follows:

Random Attention: All tokens attending to a set of random tokens (r).
Window Local Attention: All tokens attending to a set of local neighboring tokens (w).
Global Attention: A set of global tokens (g) attending all parts of the sequence.

Figure 2: Diagram illustrating different types of attention mechanisms. The last one is BigBird’s sparse attention mechanism.

Let’s take a look at each type of attention mechanism in more detail.

1. Random Attention: Figure 2a illustrates the random attention mechanism, where r=1 with block size 2. In this, every query block randomly attends to random key (r) blocks, meaning in Figure 2a, each query block of size 2 attends to a key block of size 2 (randomly).

2. Window local attention: While creating the block, it is ensured that the number of query blocks
and the number of key blocks are the same. This aids in defining the block window
attention. Each query block with index j attends to the key block with index j − (w − 1)/2 to
j + (w − 1)/2, including key block j. Figure 2b shows sliding window attention with w = 3 and block size 2, meaning each query block j attends to key block j − 1, j, j + 1. This ensures that every query attends to at least one block of keys of size b on each side and a maximum of two blocks.

Figure 3 further illustrates the idea behind the window attention mechanism in detail for different parameters.

Figure 3: Diagram illustrating how window local attention is obtained (in general) by “blocking” the query and key matrix, copying the key matrix, and rolling the resulting key tensor.

3. Global attention: Global attention is computed in terms of blocks. Figure 2c illustrates the global attention mechanism with g = 1 and block size 2. For BIGBIRD-ITC, this suggests that
one query and key block attend to everyone.

Figure 2d illustrates the resulting overall attention mechanism used in BigBird. To sump up, we can say that the final attention mechanism for BigBird has the following three properties:

– queries attend
to random keys (r)

– each query attends to w/2 tokens to the right of its location and w/2 to the left of
its location

– contains global tokens (g) that can be from already existing tokens or extra added tokens

Unfortunately, when it comes to the computation of this attention score by simply multiplying arbitrary pairs of key and query vectors, it usually requires the use of the gather operation, which turns out to be inefficient. Upon examination of the global attention and window attention, it was found that these attention scores can be calculated without using a gather operation.

Figure 4: Overview of BigBird attention computation.

BigBird Attention Computation: Structured block sparsity aids in compactly packing the operations of sparse attention, thereby making the method efficient on GPU or TPU. Figure 4 shows the transformed dense query and key tensors on the left. The query tensor is obtained by blocking and reshaping, whereas the final key tensor is obtained by concatenating three transformations: The first green columns (which corresponds to global attention) is fixed. The center blue columns (corresponding to
window local attention) are obtained by aptly rolling. A computationally inefficient gather operation is supposed to be used for the last orange columns (which correspond to random attention).

Dense multiplication between the query and key tensors effectively computes the sparse attention pattern (except for the first-row block, which is calculated using direct multiplication). The resulting matrix on the right (in Figure 4) is identical to that shown in Figure 2d.

Potential Applications of BigBird

Some of the applications of BigBird are as follows:

1. Genomics Processing: Genomics sequence is provided as input to the encoder for tasks like methylation analysis, predicting functional impacts of non-coding variants, etc.

2. Question Answering and Long Document Summarization: BigBird can now handle up to 8 times larger sequence lengths than BERT, making it suitable for NLP tasks like answering and summarizing long documents.

3. Search Engine: Since BigBird can handle long context better than BERT, it can be used in search engines.

Limitations of BigBird

The sparse attention mechanisms can’t universally substitute dense attention
mechanisms. Moreover, switching to a sparse attention mechanism does incur a cost.

BigBird for Language Modeling Task

For this, we will first install and import all the required packages. Following that, we will load the model (“google/bigbird-roberta-base”) and the corresponding tokenizer with the help of BigBirdMaskedLM and AutoTokenizer classes. In addition, we will also load the “squad_v2” dataset, and then we will decode the masked token at the end.

!pip install -q transformers datasets sentencepiece

import torch
from transformers import AutoTokenizer, BigBirdForMaskedLM
from datasets import load_dataset

model_name = "google/bigbird-roberta-base"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = BigBirdForMaskedLM.from_pretrained(model_name)
squad_ds = load_dataset("squad_v2", split="train")

#Randomly selecting a long article
random_long_article = squad_ds[81515]["context"]

#Adding mask token
add_mask_token = random_long_article.replace("maximum", "[MASK]")
inputs = tokenizer(add_mask_token, return_tensors="pt")

with torch.inference_mode():
       logits = model(**inputs).logits

# Retrieving index of the [MASK]
mask_token_index = (inputs.input_ids == tokenizer.mask_token_id)[0].nonzero(as_tuple=True)[0]
predicted_token_id = logits[0, mask_token_index].argmax(axis=-1)
tokenizer.decode(predicted_token_id)

>> Output: “maximum”

Link to Colab Notebook: https://bit.ly/3fgOYXN

Conclusion

To sum it up, in this article, we learned the following:

BigBird is a sparse-attention-based transformer that extends transformer-based models like BERT to 8 times longer sequences (4096 tokens) in such a manner that empirical advantages of a fully quadratic self-attention scheme are retained with fewer computations.
BigBird satisfies all the known theoretical properties of the full transformer. In particular, it was demonstrated that adding extra global tokens preserves the expressiveness of the model by allowing the expression of continuous sequence-to-sequence functions with only O(n)-inner products.
Extended context modeled by BigBird benefits various NLP tasks like question answering, summarization, long document classification, etc.
The sparse attention mechanisms can’t universally substitute dense attention mechanisms. Moreover, switching to a sparse attention mechanism does incur a cost.

That concludes this article. Thanks for reading. If you have any questions or concerns, please post them in the comments section below. Happy learning!

Link to Research Paper: https://arxiv.org/pdf/2007.14062.pdf

Link to Colab Notebook: https://bit.ly/3fgOYXN

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Drishti

I'm a Researcher who works primarily on various Acoustic DL, NLP, and RL tasks. Here, my writing predominantly revolves around topics related to Acoustic DL, NLP, and RL, as well as new emerging technologies. In addition to all of this, I also contribute to open-source projects @Hugging Face.
For work-related queries please contact: [email protected]

Datasets Intermediate NLP Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

An Introduction to BigBird

Introduction

Highlights

What is the Impact of the Self-Attention Mechanism in Transformers?

Why Did We Need a BigBird-like Model?

BigBird Architecture

Potential Applications of BigBird

Limitations of BigBird

BigBird for Language Modeling Task

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics