Alifia Ghantiwala — November 28, 2021
Advanced NLP Python Text

This article was published as a part of the Data Science Blogathon.

Consider a scenario wherein instead of reading an entire article or research paper you could just read the most important statements, this is possible through text summarization. Text summarization takes an input of a sequence of words aka the input article and returns an output of words aka the summary. Such models are called sequence-to-sequence models. Text summarization can be a useful case study in domains like financial research, question-answer bots, media monitoring, social media marketing, and so on. In this article we would cover text summarization in detail, below is the list of topics.

1) Types of neural text summarization

2) Using a pre-trained summarizer and evaluating its output

3) Understanding BLEU score and its calculation

4) Coding a text summarizer in python from scratch

Types of neural text summarization

In school, most of us had to understand and convert long text articles into their succinct summaries, the technique we used then was to grasp the underlying idea of the text and reproduce the summary that would cover all the important points. This is similar to the idea of abstractive text summarization, wherein the machine learning model would output the main idea of the input text using similar words but not exact sentences from the input.

The second type of summarization is extractive summarization in which the model output can be considered a subset of the input text which conveys the main idea of the input article. A personal analogy that I would like to share is, you can consider extractive summarization as highlighting important points of a reference paper that you are trying to understand.

As you may have guessed extractive summarization is simpler to model than abstractive summarization, this is because in abstractive summarization the model is expected to understand language and its nuances to make any meaning out of it and produce a valid summary. Whereas in extractive summarization using some form of scoring (which we would discuss in detail later in this article), the model has to threshold and output the most important sentences of the input itself.

Naturally, there is more research available for extractive summarization than abstractive summarization. In this article, we would look into extractive summarization in further detail.

Using a pre-trained summarizer and evaluating its output

What do we mean by pre-trained models:- These models have already been trained on large datasets. If a model is trained on huge amounts of data it will naturally predict better, however, the inability to collect large amounts of data and subsequently higher training time are some of the reasons why instead of training a model from scratch we could benefit by using a pre-trained model.

We would be using the BBC News Summary dataset for this article and bert-extractive-summarizer as the pre-trained model.

Below code, snippet includes loading the necessary libraries

!pip install bert-extractive-summarizer
!pip install spacy
!pip install transformers # > 4.0.0
!pip install neuralcoref
!python -m spacy download en_core_web_md

After importing the above libraries and downloading the spacy model we would now call the summarizer and pass a sample text to view its output.

#from summarizer import Summarizer

model = Summarizer()

text = "Learning NLP involves understanding basic principles of machine learning which then need to be customized for words. With the advent of using transfer learning for NLP I think it hads made a huge progress in terms of its research"

As you can see in the below output the model does provide an appropriate summary given our input text.

Now let us use the same model on our BBC news dataset, the below snippet takes care of the same. As we have a total of 2225 input articles with an average length of 3000 words, to save execution time I have predicted the summary items only for the first 10 input articles.

from tqdm import tqdm
bert_predicted_summary = []
k = 0
for i in tqdm(df['text']):
    if k < 10:
        x = model(str(i))
        bert_predicted_summary.append(x)
        k+=1

Below is the attached output, the first one is what the pre-trained model predicted and the second one is the actual summary provided in the dataset.

Text summarization  model

Using simple preprocessing techniques like removing newline characters(n) or end of sentence characters(b) is always recommended. As the popular saying goes garbage in is garbage out, so we need to clean our input before passing it to our model. I have used simple regular expressions for preprocessing the input, the code snippet for the same is as below.

path = '/kaggle/input/bbc-news-summary/bbc news summary/BBC News Summary/News Articles/'
for i in os.listdir(path):
    for j in os.listdir(os.path.join(path+i)):
        with open(os.path.join(path+i+'/'+j),'rb') as f:
                article = f.readlines()
                article = re.sub('b'','',str(article))
                article = re.sub('[\nnt-\/]','',article)
                article = re.sub('n'','',article)
                article = re.sub('xc2xa','',article)
                article = article.lower()
                text.append(article)
                type_.append(i)

For evaluating the output the metric we use is the BLEU score, in the next section of the article, we would go through the same in detail.

Understanding BLEU score and its calculation:-

BLEU score stands for Bilingual Evaluation Understudy, it is a metric widely used for machine translation, text generation, and for models having a word sequence as output. Let us understand how it is calculated.

The range of BLEU scores is between 0 and 1, where 0 signifies no match between the expected output and the predicted output and 1 means a perfect match. BLEU can be considered as a modification to precision to handle sequence outputs.

Considering an example, suppose our predicted summary (or candidate) is “awesome awesome awesome” and our actual or expected summary(also known as reference) is “NLP is awesome”, as all of the words in our predicted output are present in the reference it has a precision of 1, however, we can all agree, it is a pretty bad summary.

To overcome this BLEU performs a simple modification, it clips the number of times a word is seen in the candidate or predicted output to the maximum times it appears in the reference or expected output. So in the case of our example, the score now becomes 1/3 as awesome is present only once in the reference.

Taking another example, let’s say our reference is “I want to learn NLP”, and our candidate is “NLP is what I want to learn” if we consider only unigrams BLEU score would be perfect, i.e. 1. But so would be the BLEU score for  “NLP learn I want to”, which is not correct grammatically.

This is why BLEU also considers n-grams(bigrams, trigrams 4-grams).  If we account for bigrams in the same example, then bigrams that are possible from our candidate are “NLP is”, “is what”, “what I”, “I want”, “want to”, “to learn”. and the bigram precision score now becomes 3/4. This explains that BLEU rewards exact matching sequences of words between candidate and reference.

BLEU also penalizes sentences shorter than the reference sentence, to understand why it does so we extend the original example, now consider our candidate to be “NLP is”, if we consider bigrams, this candidate would receive a BLEU score of 1, BLEU would penalize the score by multiplying it with the penalty which is calculated as, divide the length of the reference sentence by the length of our output, subtract one from that, and raise it to the power of e. In our case, the penalty would be 0.36 making our BLEU score 0.36 from 1.

We can all now agree why BLEU is a widely used metric but it does have some flaws like it does not consider meaning. You can further read about problems with BLEU to gain a better understanding of the metric here 

We now look at the below BLEU scores for our generated summaries through the pre-trained BERT model

Code snippet for the calculation

def calculate_bleu_score(bert_predicted_summary,df):
    for i in range(len(bert_predicted_summary)):
        candidate = list(bert_predicted_summary[i].split("."))
        reference = list(str(df['summary'][i]).split("."))
        print(corpus_bleu(reference[:len(candidate)],candidate))
calculate_bleu_score(bert_predicted_summary,df)

Output

Text summarization - bleu score

We can see that with basic preprocessing and without fine-tuning the pre-trained model, for the first 10 predicted summaries we receive a good score for each of the summaries with an average of 0.6 BLEU score.

Now let us dig further deep and create our own text summarizer using python.

Coding a text summarization model in python from scratch

Why do we need to build an extractive summarizer from scratch when we already have amazing pre-trained models available?
To help build intuition and not consider it simply as a black box that gives us our desired output. With that said, now let us dig further deep and create our own text summarizer using python. As we had discussed earlier extractive summarizer needs to score sentences and return the most important sentences as the summary. There are many scoring functions possible, let us consider the below.
We give a score to each word based on their frequency of occurrence in the entire corpus(you would want to remove stop words from your text, as they might skew your frequency counts wrongly). Now the sentences in each input article would be scored based on the total sum frequencies of its consisting words.

Implementation in python is as below
def count_freq():


    res = {}
    for i in df['cleaned_text']:
        for k in word_tokenize(i):
            if k in res:
                res[k] += 1
            else:
                res[k] = 1
    return res
word_freq = count_freq()

In the above function, we create a dictionary word_freq which includes word count for every word present in the corpus.

def sentence_rank(text):
    weights = []
    sentences = sent_tokenize(text)
    for sentence in sentences:
        temp = 0
        words = word_tokenize(sentence)
        for word in words:
            temp += word_freq[word]
        weights.append(temp)
    return weights
As part of the sentence_rank function, we provide weight to the sentences which would be the sum of word counts of all words present in the sentence.

n = 14
for i in range(10):
    ranked_sentences = sentence_rank(df['cleaned_text'][i])
    sentences = sent_tokenize(df['cleaned_text'][i])
    sort_list = np.argsort(ranked_sentences)[::-1][:n]
    result = ''
    for i in range(n):
        result += '{} '.format(sentences[sort_list[i]])
    candidate = result
    reference = df['summary'][i]
    print(corpus_bleu(reference[:len(candidate)],candidate[:len(reference)]))
In the above code snippet, we are just making use of the sentence_rank function we discussed above, to summarize each of the input articles and calculate the bleu scores. n is a hyperparameter that controls the length of the generated summary, after iterating over some values I have chosen a length of 14 as it was giving me a good BLEU score. As you can see below with our very basic text summarizer we are able to achieve on average a BLEU score of 0.5 which is 0.1 lesser than what we achieved with the pre-trained model on the same input.

For improving the text summarizer, we could use

1) TF-IDF scores instead of just using word frequencies
2) Sequence to Sequence Encoder-Decoder models and so on

While there is definitely scope for improvement for our text summarizer, I would end this article here. If you have any suggestions regarding the improvement of the article, feel free to comment below.
A tiny bit about me:-
I am Alifia, currently working as an analyst. By writing these articles I try to deepen my understanding of applied machine learning.  

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *