Summarize Twitter Live data using Pretrained NLP models
Twitter users spend an average of 4 minutes on social media Twitter. On an average of 1 minute, they read the same stuff. It shows that users spend around 25% of their time reading the same stuff.
Also, most of the tweets will not appear on your dashboard. You may get to know the trending topics, but you miss not trending topics. In trending topics, you might only read the top 5 tweets and their comments.
So, what are you going to do to avoid wastage of time on Twitter?
I would say summarize your whole trending Twitter tags data. And, then you can finish reading all trending tweets in less than 2 minutes.
In this article, I will explain to you how you can leverage Natural Language Processing (NLP) pre-trained models to summarize twitter posts based on hashtags. We will use 4 ( T5, BART, GPT-2, XLNet) pre-trained models for this job.
Why use 4 types of pre-trained models for summarization?
Each pre-trained model has its own architecture and weights. So, the summarization output given by these models could be different from each other.
Test the twitter data on different models and then choose the model which shows summarization close to your understanding. And then deploy that model into production.
Let’s start with collecting Twitter Live data.
Twitter Live Data
You can get Twitter live data in 2 ways.
- Official Twitter API. Follow this article to get a Twitter dataset.
- Use the Beautiful Soup library to scrape the data from Twitter.
I will be using step 1 to fetch the data. Once you receive the credentials for Twitter API, follow the below code to get Twitter data through API.
Now, let’s start summarizing data using pre-trained models one by one.
1. Summarization using T5 Model
T5 is a state of the art model used in various NLP tasks that includes summarization. We will be using the transformers library to download the T5 pre-trained model and load that model in a code.
The Transformers library is developed and maintained by the Hugging Face team. It’s an open-source library.
Know more about the T5 model here.
Here is code to summarize the Twitter dataset using the T5 model.
Observation on Code
- You can use different types of T5 pre-trained models having different weights and architecture. Available versions of the T5 model in the transformer library are t5-base, t5-large, t5-small, t5-3B, and t5-11B.
- Return_tensor value should be pt for PyTorch.
- The maximum sentence length used to train the pre-models is 512. So, keep the max_length value to 512.
- The length of the summarized sentence increase with an increase in length_penality value. Length_penality=1 means no penalty.
2. Summarization using BART models
BART uses both BERT (bidirectional encoder) and GPT (left to the right decoder) architecture with seq2seq translation. BART achieves the state of the art results in the summarization task.
BART pre-trained model is trained on CNN/Daily mail data for the summarization task, but it will also give good results for the Twitter dataset.
We will take advantage of the hugging face transformer library to download the T5 model and then load the model in a code.
Here is code to summarize the Twitter dataset using the BART model.
Observation on Code
- You can increase and decrease the length of the summarization using min_length and max_length. Ideally, summarization length should be 10% to 20% of the total article length.
- This model is ideally suitable to summarize the news articles. But it can also give good results on Twitter data.
- You can use different BART model versions such as bart-large, bart-base, bart-large-cnn, and bart-large-mnli.
3. Summarization using GPT-2 model
GPT-2 model with 1.5 million parameters is a large transformer-based language model. It’s trained for predicting the next word. So, we can use this specialty to summarize Twitter data.
GPT-2 models come with various versions. And, each version’s size is more than 1 GB.
We will be using the bert-extractive-summarizer library to download GPT-2 models. Learn more about the bert-extractive-summarizer library here.
Use pip install bert-extractive-summarizer command to install the library.
Here is a code to summarize the Twitter dataset using the GPT-2 model.
Observation on Code
- The transformer_type value will vary according to the pre-trained model we use.
- You can change the transformer_model_key as per the requirement. GPT-2 has four versions gpt2, gpt2-medium, gpt2-large and gpt2-XL.
- This library also has a min_length and max_length option. You can assign values to these variables as per your requirement.
4. Summarization using XLNet model
XLNet is an improved version of the BERT model which implement permutation language modeling in its architecture. Also, XLNet is a bidirectional transformer where the next tokens are predicted in random order.
The XLNet model has two versions xlnet-base-cased and xlnet-large-cased.
Here is a code to summarize the Twitter dataset using the XLNet model.
Observation on Code
- You can change the value of min_length and max_length as per your requirement.
- This model will trim the sentence length if it exceeds 512 value.
Other use-cases of Summarization
- Summarize each article and present it to the readers as a summary.
- You can use this method to generate high-quality SEO. It will help your articles to discover more on google.
- Summarize the whole comment section of the post. These posts may belong to Reddit or Twitter social media platform.
- You can summarize the whitepapers, e-books, or blog posts and share them on your social media platform.
In this article, we have summarized the Twitter live data using T5, BART, GPT-2, and XLNet pre-trained models. Each model generates a different summarize output for the same dataset. Summarization by the T5 model and BART has outperformed the GPT-2 and XLNet models.
These pre-trained models can also summarize articles, e-books, blogs with human-level performance. In the future, you can see a lot of improvements in summarization tasks. And this will help you to solve many summarization related tasks.