We are standing at the intersection of language and machines. I’m fascinated by this topic. Can a machine write as well as Shakespeare? What if a machine could improve my own writing skills? Could a robot interpret a sarcastic remark?
I’m sure you’ve asked these questions before. Natural Language Processing (NLP) also aims to answer these questions, and I must say, there has been groundbreaking research done in this field towards bridging the gap between humans and machines.
One of the core ideas in NLP is text classification. If a machine can differentiate between a noun and a verb, or if it can detect a customer’s satisfaction with the product in his/her review, we can use this understanding for other advanced NLP tasks like understanding context or even generating a brand new story!
That’s primarily the reason we’ve seen a lot of research in text classification. And yes, the advent of transfer learning has definitely helped accelerate the research. We are now able to use a pre-existing model built on a huge dataset and tune it to achieve other tasks on a different dataset.
Transfer learning, and pretrained models, have 2 major advantages:
You can see why there’s been a surge in the popularity of pretrained models. We’ve seen the likes of Google’s BERT and OpenAI’s GPT-2 really take the bull by the horns. I’ll cover 6 state-of-the-art text classification pretrained models in this article.
I assume that you are aware of what text classification is. Here’s a comprehensive tutorial to get you up to date:
We can’t review state-of-the-art pretrained models without mentioning XLNet!
Google’s latest model, XLNet achieved State-of-the-Art (SOTA) performance on the major NLP tasks such as Text Classification, Sentiment Analysis, Question Answering, and Natural Language Inference along with the essential GLUE benchmark for English. It outperformed BERT and has now cemented itself as the model to beat for not only text classification, but also advanced NLP tasks.
The core ideas behind XLNet are:
If this sounds complicated, don’t worry! I’ll break this down into simple words.
Autoregressive modeling is used to predict the next word using the context words occurring either before or after the missing word in question. However, we can’t process both the forward and backward directions at the same time.
Though BERT’s autoencoder did take care of this aspect, it did have other disadvantages like assuming no correlation between the masked words. To combat this, XLNet proposes a technique called Permutation Language Modeling during the pre-training phase. This technique uses permutations to generate information from both the forward and backward directions simultaneously.
It is no secret that the Transformer architecture has been a game-changer. XLNet uses Transformer XL. As we know, transformers were an alternative to recurrent neural networks (RNN) in the sense that they allowed non-adjacent tokens to be processed together as well. This improved understanding of long-distance relations in text. Transformer-XL is basically an enhanced version of the transformer used in BERT by adding two components:
As I mentioned previously, XLNet outperformed BERT on almost all tasks, including Text Classification and achieves SOTA performance on 18 of them!
Here is a summary of the Text Classification tasks and how XLNet performs on these different datasets and the high rank it has achieved on them:
Though ERNIE 1.0 (released in March 2019) has been a popular model for text classification, it was ERNIE 2.0 which became the talk of the town in the latter half of 2019. Developed by tech-giant Baidu, ERNIE outperformed Google XLNet and BERT on the GLUE benchmark for English.
ERNIE stands for Enhanced Representation through kNowledge IntEgration, and ERNIE 2.0 is an upgraded version of ERNIE 1.0. ERNIE 1.0 was pathbreaking in its own way – it was one of the first models to leverage Knowledge Graphs. This incorporation further enhanced training the model for advanced tasks like Relation Classification and NamedEntityRecognition (NER).
Like its predecessor, ERNIE 2.0 brings another innovation to the table in the form of Continual Incremental Multi-task Learning. Basically, this means that the model has defined 7 clear tasks, and
I really like how intuitive this process is since it follows a human way of understanding text. Our brain does not only think of “I like going to New York” as a positive sentence, does it? It simultaneously understands the nouns “New York”, and I; understand the verb “like”, and infers that New York is a place.
ERNIE achieves a SOTA F1-Score of 88.32 on the Relation Extraction Task.
I’ll be honest – I had the most fun studying this model as compared to the others. Google’s new Text-to-Text Transfer Transformer (T5) model uses transfer learning for a variety of NLP tasks.
The most interesting part is that it converts every problem to a text input – a text output model. So, even for a classification task, the input will be text, and the output will again be a word instead of a label. This boils down to a single model on all tasks. Not only this, the output for one task can be used as input for the next task.
The corpus uses an enhanced version of Common Crawls. This is basically scraped text from the web. The paper actually highlights the importance of cleaning the data, and clearly elucidates how this was done. Though the scraped data generates data of 20TB per month, most of this data is not suitable for NLP tasks.
Even after retaining only text content (pages containing markups, code content, etc have been removed), the corpus still has a size of a whopping 750GB which is much larger than most datasets.
Note: This has been released on TensorFlow too: c4.
The task which is to be performed is encoded as a prefix along with the input. As you can see in the diagram above, be it a classification or a regression task, the T5 model still generates new text to get the output.
The T5 achieves SOTA on more than 20 established NLP tasks – this is rare, and taking a look at the metrics, it is as close to a human output as possible.
The T5 model follows up on the recent trend of training on unlabelled data and then fine-tuning this model on the labeled text. Understandably, this model is huge, but it would be interesting to see further research on scaling down such models for wider usage and distribution.
As we have seen so far, the Transformer architecture is quite popular in NLP research. The BP Transformer again uses the transformer, or rather an enhanced version of it for text classification, machine translation, etc.
However, using a transformer is still a costly process since it uses the self-attention mechanism. Self-attention just means that we are performing the attention operation on the sentence itself, as opposed to 2 different sentences (this is attention). Self-attention helps identify the relationship between the words in a single sentence. It is this self-attention mechanism that contributes to the cost of using a transformer.
The Binary-Partitioning Transformer (BPT) aims to improve the efficiency of the self-attention mechanism by treating the transformer as a graph neural network. Essentially, each node in this graph represents an input token.
This is how the BP Transformer works:
Step 1: Divide the sentence into 2 parts recursively until some stopping condition is reached. This is called a binary partitioning. So, for example, the sentence “I like going to New York” will have the following partitions:
Note: A sentence with n words will have 2*n – 1 partitions, and in the end, you have a complete binary tree.
Step 2: Each partition is now a node in the Graph Neural Network. There can be 2 types of edges:
Step 3: Perform self-attention on each node of the graph with its neighboring nodes:
The BPT achieves:
I appreciate this model in the sense that it made me revisit the concept of graphs and made me venture into looking up graph neural networks. I am admittedly late to the party, but I will surely be exploring more on Graph Neural networks in the near future!
Neural networks have always been the most popular models for NLP tasks and they outperform the more traditional models. Additionally, replacing entities with words while building the knowledge base from the corpus has improved model learning.
This means that instead of building vocabulary from the words in a corpus, we build a bag of entities using Entity Linking. Though there has been research on this method of representing the corpus to the model, the NABoE model goes a step further by:
The Neural Attentive Bag of Entities model uses the Wikipedia corpus to detect the associated entities with a word. For example, the word “Apple” can refer to the fruit, the company, and other possible entities. Once all these entities are retrieved, the weight of each entity is calculated using the softmax-based attention function. This gives a smaller subset of entities which are relevant only to that particular document.
In the end, the final representation of the word is given by its vectorized embedding combined with the vectorized embedding of the relevant entities associated with the word.
The NABoE model performs particularly well on Text Classification tasks:
Now, it might appear counter-intuitive to study all these advanced pretrained models and at the end, discuss a model that uses plain (relatively) old Bidirectional LSTM to achieve SOTA performance. But that was precisely why I decided to introduce it at the end.
Oftentimes, we miss the forest for the trees. We tend to forget that a simple well-tuned model might achieve just as good results as these complex deep learning models. This paper aims to explain just that.
A combination of Bidirectional LSTM and Regularization is able to achieve SOTA performance on the IMDb document classification task and stands shoulder-to-shoulder with other bigwigs in this domain.
The most intriguing and noteworthy aspects of this paper are:
This minimalistic model uses Adam optimizer, temporal averaging and dropouts to achieve this high score. The paper empirically compares these results with other deep learning models and demonstrates how this model is simple but effective and the results speak for themselves:
This kind of model can be considered a novel approach for the industry where it is important to build production-ready models and yet achieve high scores on your metrics.
Here, we discussed the top 6 pretrained models that achieved state-of-the-art benchmarks in text classification recently. These NLP models show that there are many more ones yet to come and I will be looking forward to learning about them this year.
The one awesome element in all this research is the availability and open source nature of these pretrained models. All the above models have a GitHub repository to them and are available for implementation. Another aspect that cannot be ignored is that they are available on PyTorch as well. This emphasizes that PyTorch is fast replacing TensorFlow as THE platform to build your deep learning models.
I encourage you to try out these models on various datasets and experiment with them to understand how they work. If you have some models in mind which were just as cool but went under the radar last year, do mention them in the comments below!