Understanding the XLNet Pre-trained Model

Mounish V 21 May, 2024
6 min read


XLNet is an autoregressive pretraining method proposed in the paper “XLNet: Generalized Autoregressive Pretraining for Language Understanding ”. XLNet uses an innovative approach to training. Unlike previous models like BERT, which use masked language modeling (MLM), where certain words are masked and predicted based on context, XLNet employs permutation language modeling (PLM). This means it trains on all possible permutations of the input sequence, enabling it to capture bidirectional context without masking. XLNet has various use cases, some of which are explored in this article.

Learning Objectives

  • Understand XLNet’s difference from traditional autoregressive models and its permutation language modeling (PLM) adoption.
  • Get familiar with XLNet’s architecture, including input embeddings, Transformer blocks, and self-attention mechanisms.
  • Comprehend the two-stream language modeling approach in XLNet to capture bidirectional context effectively.
  • Explore XLNet’s application domains, including natural language understanding tasks and other applications like question answering and text generation.
  • Learn practical implementation through code demonstrations for tasks such as multiple-choice question answering and text classification.

What is XLNet?

In traditional autoregressive language models like GPT (Generative Pre-trained Transformer), each token in the input sequence is predicted based on the tokens that precede it. However, this sequential nature limits the model’s ability to capture bidirectional dependencies effectively.

PLM addresses this limitation by training the model to predict a token given its context, not just its left context as in autoregressive models, but all possible permutations of its context.

What is XLNet?

Architecture of XLNet

XLNet comprises input embeddings, multiple Transformer blocks with self-attention, position-wise feedforward networks, layer normalization, and residual connections. Its multi-head self-attention differs by allowing each token to attend to itself, enhancing contextual understanding compared to other models.

Architecture of XLNet

Two-Stream Language Modeling

In XLNet, a dual-stream approach is used during pre-training. It involves learning two separate probability distributions over tokens in a sequence, each conditioned on a different permutation of the input tokens. One autoregressive stream predicts each token based on the tokens preceding it in a fixed order. In contrast, the other stream is bidirectional, allowing tokens to attend to preceding and succeeding tokens. This approach helps XLNet capture bidirectional context effectively during pre-training, improving performance on downstream natural language processing tasks.

Content Stream: Encodes the actual words and their contexts.

Query Stream: Encodes the context information needed to predict the next word without seeing it.

These streams allow the model to gather contextual information while avoiding trivial predictions based on the word.


XLNet and BERT are advanced language models that significantly impact natural language processing. BERT (Bidirectional Encoder Representations from Transformers) uses a masked language modeling approach, masking some tokens in a sequence and training the model to predict these masked tokens based on the context provided by the unmasked tokens. This bidirectional context allows BERT to understand the meaning of words based on their surrounding words. BERT’s bidirectional training captures rich contextual information, making it highly effective for various NLP tasks like question answering and sentiment analysis.

XLNet, on the other hand, enhances BERT’s capabilities by integrating autoregressive and autoencoding approaches. It introduces permutation language modeling, which considers all possible word order permutations in a sequence during training. This method enables XLNet to capture bidirectional context without relying on the masking technique, thus preserving the dependency among words.

Additionally, XLNet employs a two-stream attention mechanism to handle context and word prediction better. As a result, XLNet achieves superior performance on many benchmark NLP tasks by leveraging a more comprehensive understanding of language context compared to BERT’s fixed bidirectional approach.

Use Cases of XLNet

Natural Language Understanding (NLU):

XLNet can be used for tasks like sentiment analysis, text classification, named entity recognition, and language modeling. Its ability to capture bidirectional context and relationships within the text makes it suitable for various NLU tasks.

Question Answering:

You can fine-tune XLNet for question-answering tasks, where it reads a passage of text and answers questions related to it. It has shown competitive performance on benchmarks like SQuAD (Stanford Question Answering Dataset).

Text Generation:

Due to its autoregressive nature and ability to capture bidirectional context, XLNet can generate coherent and contextually relevant text. This makes it useful for tasks like dialogue generation, summarization, and machine translation.

Machine Translation:

XLNet can be fine-tuned for machine translation tasks, translating text from one language to another. Although not specifically designed for translation, its powerful language representation capabilities make it suitable for this task when fine-tuned with translation datasets.

Information Retrieval:

Users can employ it to understand and retrieve relevant information from large volumes of text, making it valuable for applications like search engines, document retrieval, and information extraction.

How to Use XLNet for MCQs?

This code demonstrates how to use the model for multiple-choice question answering.

from transformers import AutoTokenizer, XLNetForMultipleChoice
import torch

tokenizer = AutoTokenizer.from_pretrained("xlnet/xlnet-base-cased")
model = XLNetForMultipleChoice.from_pretrained("xlnet/xlnet-base-cased")

# New prompt and choices
prompt = "What is the capital of France?"
choice0 = "Paris"
choice1 = "London"

# Encode prompt and choices
encoding = tokenizer([prompt, prompt], [choice0, choice1], return_tensors="pt", padding=True)

# Check if model is loaded (safety precaution)

model is not None:
    outputs = model(**{k: v.unsqueeze(0) for k, v in encoding.items()})

    # Extract logits (assuming the model is loaded)
    if outputs is not None:
        logits = outputs.logits

        # Predicted class with highest logit (assuming logits are available)
        if logits is not None:
            predicted_class = torch.argmax(logits, dim=-1).item()  # Get the class with the highest score

            # Print chosen answer based on predicted class
            chosen_answer = choice0 if predicted_class == 0 else choice1
            print(f"Predicted Answer: {chosen_answer}")
            print("Model outputs not available (potentially due to an untrained model).")
    print("Model not loaded successfully.")
How to Use XLNet for MCQs?

After defining a prompt and choices, it encodes them using the tokenizer and passes them through the model to obtain predictions. The predicted answer is then determined based on the highest logit. Finetuning this pre-trained model on a decently sized prompts and choices dataset should theoretically yield good results.

XLNet for Text Classification

Demonstration of Python code for text classification using XLNet

from transformers import XLNetTokenizer, TFXLNetForSequenceClassification
import tensorflow as tf

import warnings

# Ignore all warnings

# Define labels (modify as needed)
labels = ["Positive", "Negative"]

# Load tokenizer and pre-trained model
tokenizer = XLNetTokenizer.from_pretrained('xlnet-base-cased')
model = TFXLNetForSequenceClassification.from_pretrained('xlnet-base-cased', num_labels=len(labels))

# Sample text data
text_data = ["This movie was amazing!", "I hated this restaurant."]

# Preprocess text (tokenization)
encoded_data = tokenizer(text_data, padding="max_length", truncation=True, return_tensors="tf")

# Perform classification
outputs = model(encoded_data)
predictions = tf.nn.softmax(outputs.logits, axis=-1)

# Print predictions
for i, text in enumerate(text_data):
    predicted_label = labels[tf.argmax(predictions[i]).numpy()]
    print(f"Text: {text}\nPredicted Label: {predicted_label}")
XLNet for Text Classification

The tokenizer preprocesses the provided sample text data for classification, ensuring it is appropriately tokenized and padded. Then, the model performs classification on the encoded data, generating outputs. These outputs undergo a sigmoid/softmax (based on the number of classes) function to derive predicted probabilities for each label.


In summary, XLNet offers an innovative approach to language understanding through permutation language modeling (PLM). By training on all possible permutations of input sequences, XLNet efficiently captures bidirectional context without the need for masking, thus surpassing the limitations of traditional autoregressive models like BERT.

Frequently Asked Questions

Q1. What is the main difference between XLNet and traditional autoregressive models like GPT?

A. XLNet uses permutation language modeling (PLM) to consider all possible permutations of the input sequence, unlike traditional autoregressive models, which predict tokens based on preceding tokens in a fixed order. This approach helps XLNet effectively capture bidirectional context.

Q2. How does XLNet differ from BERT in handling language context?

A. While BERT uses masked language modeling (MLM) to predict masked tokens based on their context, it employs permutation language modeling (PLM), which captures bidirectional context without masking. It uses a two-stream attention mechanism for better context handling and word prediction.

Q3. What are some practical applications of XLNet?

A. XLNet can be used for various natural language understanding tasks such as sentiment analysis, text classification, named entity recognition, and language modeling. It performs well in question answering, text generation, machine translation, and information retrieval tasks.

Mounish V 21 May, 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers