Tired of Reading Long Articles? Text Summarization will make your task easier!

Ekta Last Updated : 24 Oct, 2024

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Millions of web pages and websites exist on the Internet today. Going through a vast amount of content becomes very difficult to extract information on a certain topic, especially for a text summarization project. Google will filter the search results and give you the top ten search results, but often you are unable to find the right content that you need. There is a lot of redundant and overlapping data in the articles which leads to a lot of wastage of time. The better way to deal with this problem is to summarize the text data which is available in large amounts to smaller sizes.

Introduction
Text Summarization
Text Summarization steps
Obtain Data for Summarization
Text Preprocessing
Convert text to sentences
Finding weighted frequencies of occurrence
Calculate sentence scores
Summary of the article
Conclusion
Frequently Asked Questions

Text Summarization

Text summarization is an NLP technique that extracts text from a large amount of data. It helps in creating a shorter version of the large text available.

It is important because :

Reduces reading time
Helps in better research work
Increases the amount of information that can fit in an area

There are two approaches for text summarization: NLP based techniques and deep learning techniques.

In this article, we will go through an NLP based technique which will make use of the NLTK library.

Text Summarization steps

Obtain Data
Text Preprocessing
Convert paragraphs to sentences
Tokenizing the sentences
Find weighted frequency of occurrence
Replace words by weighted frequency in sentences
Sort sentences in descending order of weights
Summarizing the Article

Obtain Data for Summarization

If you wish to summarize a Wikipedia Article, obtain the URL for the article that you wish to summarize. We will obtain data from the URL using the concept of Web scraping. Now, to use web scraping you will need to install the beautifulsoup library in Python. This library will be used to fetch the data on the web page within the various HTML tags.

Use the below command:

pip install beautifulsoup4

To parse the HTML tags we will further require a parser, that is the lxml package:

pip install lxml

We will try to summarize the Reinforcement Learning page on Wikipedia.Python Code for obtaining the data through web-scraping:
Python Code:

import bs4 as bs
import urllib.request
#import re
scraped_data = urllib.request.urlopen('https://en.wikipedia.org/wiki/Reinforcement_learning')
article = scraped_data.read()
parsed_article = bs.BeautifulSoup(article,'lxml')
paragraphs = parsed_article.find_all('p')
article_text = ""
for p in paragraphs:
    article_text += p.text

print(article_text)

In this script, we first begin with importing the required libraries for web scraping i.e. BeautifulSoup. The urllib package is required for parsing the URL. Re is the library for regular expressions that are used for text pre-processing. The urlopen function will be used to scrape the data. The read() will read the data on the URL. Further on, we will parse the data with the help of the BeautifulSoup object and the lxml parser.

In the Wikipedia articles, the text is present in the <p> tags. Hence we are using the find_all function to retrieve all the text which is wrapped within the <p> tags.

After scraping, we need to perform data preprocessing on the text extracted.

Text Preprocessing

The first task is to remove all the references made in the Wikipedia article. These references are all enclosed in square brackets. The below code will remove the square brackets and replace them with spaces.

# Removing Square Brackets and Extra Spaces
article_text = re.sub(r'[[0-9]*]', ' ', article_text)
article_text = re.sub(r's+', ' ', article_text)

The article_text will contain text without brackets which is the original text. We are not removing any other words or punctuation marks as we will use them directly to create the summaries.

Execute the below code to create weighted frequencies and also to clean the text:

# Removing special characters and digits
formatted_article_text = re.sub('[^a-zA-Z]', ' ', article_text )
formatted_article_text = re.sub(r's+', ' ', formatted_article_text)

Here the formatted_article_text contains the formatted article. We will use this object to calculate the weighted frequencies and we will replace the weighted frequencies with words in the article_text object.

Convert text to sentences

The sentences are broken down into words so that we have separate entities.

sentence_list = nltk.sent_tokenize(article_text)

We are tokenizing the article_text object as it is unfiltered data while the formatted_article_text object has formatted data devoid of punctuations etc.

Finding weighted frequencies of occurrence

stopwords = nltk.corpus.stopwords.words('english')
word_frequencies = {}
for word in nltk.word_tokenize(formatted_article_text):
    if word not in stopwords:
        if word not in word_frequencies.keys():
            word_frequencies[word] = 1
        else:
            word_frequencies[word] += 1

All English stopwords from the nltk library are stored in the stopwords variable. Iterate over all the sentences, check if the word is a stopword. If the word is not a stopword, then check for its presence in the word_frequencies dictionary. If it doesn’t exist, then insert it as a key and set its value to 1. If it is already existing, just increase its count by 1.

maximum_frequncy = max(word_frequencies.values())
for word in word_frequencies.keys():
    word_frequencies[word] = (word_frequencies[word]/maximum_frequncy)

To find the weighted frequency, divide the frequency of the word by the frequency of the most occurring word.

A glimpse of the word_frequencies dictionary:

{'Reinforcement': 0.06944444444444445,
 'learning': 0.4583333333333333,
 'RL': 0.013888888888888888,
 'area': 0.013888888888888888,
 'machine': 0.041666666666666664,
 'concerned': 0.027777777777777776,
 'software': 0.013888888888888888,
 'agents': 0.013888888888888888,
 'ought': 0.013888888888888888,
 'take': 0.027777777777777776,
 'actions': 0.1527777777777778,
 'environment': 0.08333333333333333,
 'order': 0.041666666666666664,
 'maximize': 0.041666666666666664,
 'notion': 0.027777777777777776,
 'cumulative': 0.041666666666666664,
…………

Calculate sentence scores

We have calculated the weighted frequencies. Now scores for each sentence can be calculated by adding weighted frequencies for each word.

sentence_scores = {}
for sent in sentence_list:
    for word in nltk.word_tokenize(sent.lower()):
        if word in word_frequencies.keys():
            if len(sent.split(' ')) < 30:
                if sent not in sentence_scores.keys():
                    sentence_scores[sent] = word_frequencies[word]
                else:
                    sentence_scores[sent] += word_frequencies[word]

The sentence_scores dictionary has been created which will store the sentences as keys and their occurrence as values. Iterate over all the sentences, tokenize all the words in a sentence. If the word exists in word_frequences and also if the sentence exists in sentence_scores then increase its count by 1 else insert it as a key in the sentence_scores and set its value to 1. We are not considering longer sentences hence we have set the sentence length to 30.

A glimpse of sentence_scores dictionary:

{'Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning.': 2.347222222222222,
 'Reinforcement learning differs from supervised learning in not needing labeled input/output pairs to be presented, and in not needing sub-optimal actions to be explicitly corrected.': 1.5555555555555551,
 'Instead, the focus is on finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge).': 0.4305555555555556,

Summary of the article

The sentence_scores dictionary consists of the sentences along with their scores. Now, top N sentences can be used to form the summary of the article.
Here the heapq library has been used to pick the top 7 sentences to summarize the article.

import heapq
summary_sentences = heapq.nlargest(7, sentence_scores, key=sentence_scores.get)
summary = ' '.join(summary_sentences)
print(summary)

Output:

Reinforcement learning is one of three basic machine learning paradigms, alongside supervised learning and unsupervised learning. In reinforcement learning methods, expectations are approximated by averaging over samples and using function approximation techniques to cope with the need to represent value functions over large state-action spaces. Policy iteration consists of two steps: policy evaluation and policy improvement. The work on learning ATARI games by Google DeepMind increased attention to deep reinforcement learning or end-to-end reinforcement learning. Assuming full knowledge of the MDP, the two basic approaches to compute the optimal action-value function are value iteration and policy iteration. Two elements make reinforcement learning powerful: the use of samples to optimize performance and the use of function approximation to deal with large environments. Many policy search methods may get stuck in local optima (as they are based on local search).

Conclusion

Text summarization of articles can be performed by using the NLTK library and the BeautifulSoup library. This can help in saving time. Higher Deep learning techniques can be further used to get more optimum summarizations. Looking forward to people using this mechanism for summarization.

Frequently Asked Questions

Q1. Are there different types of text summarization?

There are extractive and abstractive summarization methods, each with unique approaches to condensing text.

Q2. Can text summarization be applied to various content formats?

Absolutely, text summarization techniques are versatile and can be used on articles, documents, and online content.

Q3. How does a text summarization project work?

It employs algorithms to analyze and condense text, identifying essential information and creating concise summaries.

Ekta

Free Courses

Build a Document Retriever Search Engine with LangChain

Learn to create a document retrieval search engine using LangChain.

4.6

Coding a ChatGPT-style Language Model From Scratch in Pytorch

Build a ChatGPT-style language model using PyTorch.

4.8

Ensemble Learning and Ensemble Learning Techniques

Learn ensemble learning, its techniques, and how it works in this course!

4.5

Naive Bayes from Scratch

Master Naïve Bayes for ML: Build classifiers, analyze data, and apply Bayes.

4.9

Dimensionality Reduction for Machine Learning

Master key dimensionality reduction techniques for ML success!

Reading list

Tired of Reading Long Articles? Text Summarization will make your task easier!

Introduction

Table of contents

Text Summarization

Text Summarization steps

Obtain Data for Summarization

Text Preprocessing

Convert text to sentences

Finding weighted frequencies of occurrence

Calculate sentence scores

Summary of the article

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Build a Document Retriever Search Engine with LangChain

Coding a ChatGPT-style Language Model From Scratch in Pytorch

Ensemble Learning and Ensemble Learning Techniques

Naive Bayes from Scratch

Dimensionality Reduction for Machine Learning

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Tired of Reading Long Articles? Text Summarization will make your task easier!

Introduction

Table of contents

Text Summarization

Text Summarization steps

Obtain Data for Summarization

Text Preprocessing

Convert text to sentences

Finding weighted frequencies of occurrence

Calculate sentence scores

Summary of the article

Conclusion

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Free Courses

Build a Document Retriever Search Engine with LangChain

Coding a ChatGPT-style Language Model From Scratch in Pytorch

Ensemble Learning and Ensemble Learning Techniques

Naive Bayes from Scratch

Dimensionality Reduction for Machine Learning

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques